SJSU CS 265 - Rsync - D2898379

Home> Schools> San Jose State University> Computer Science (CS) > CS 265> Rsync

DOC PREVIEW

SJSU CS 265 - Rsync

School name San Jose State University

Course Cs 265- Cryptography and Computer Security

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Rsync: Efficient File Synchronization Using HashingBy David ShaoProjectCS 265, Spring 2004Instructor: Dr. StampDue Monday, April 5, 2004IntroductionIdeal CaseIdealized ProtocolBasic AlgorithmWeak and Strong HashesMD4 as the Strong HashModifications to the Idealized AlgorithmRolling Weaker HashCombining Ordinary and Weighted SummationExtra Signature for Entire FileSynchronizing Entire DirectoriesFuture Direction for RsyncReferencesRsync: Efficient File Synchronization Using HashingBy David ShaoProjectCS 265, Spring 2004Instructor: Dr. StampDue Monday, April 5, 2004Rsync: Efficient File Synchronization Using Hashing David ShaoIntroductionSuppose after making small changes to a file one wishes to synchronize the file’s contents with a backup at a remote location over a relatively slow network connection. Based on work for his Ph.D. thesis, Andrew Tridgell proposed and initially implemented an algorithm that attempts to minimize both the amount of data that needs to be transmitted and the number of round trip communications between the client that wishes to receive the synchronized file and the server that has the most up-to-date version of the file [Tridgell, 1999]. Tridgell’s initial implementation has grown into the open source rsync project [Rsync], one that enables the synchronization of entire directories for Unix-like environments. For this report, we follow Tridgell’s exposition of the basic ideas behind his file synchronization algorithm. [Tridgell, mostly Chapter 3]Ideal CaseSuppose we know that the most recent version of the file, which we will say resides on the sender, and its earlier version remote backup, which we will say resides on the receiver, differ only in relatively few places, similar to the figure below where the two versions differ only in the locations colored respectively red and yellow.Figure 1 Sender and Receiver Versions of the File Differ Only in a Few BlocksFor initial simplicity we can assume the versions of the file have the same length (just pad with 0 bytes if not), and we also assume the files can be divided into some number of equally sized blocks with relatively few blocks having any bytes different between versions.We wish to synchronize the receiver’s version with the sender’s version. In this ideal case, the receiver already has most of the correct blocks—it just doesn’t know it. We therefore can hope tolimit the amount of data that has to be sent by only transmitting smaller representatives instead offull blocks, in other words, by using hashes of blocks.Idealized ProtocolWe examine an initial attempt to synchronize files with the following properties:1) The receiver only has to transmit the hashes of its blocks all at once to avoid latency fromround trip communication.2) The sender only has to transmit a sequence of commands and blocks for the receiver to reconstruct the sender’s version of the file, again all at once.3) Receiver and sender will tolerate an extremely small probability of inaccuracy in synchronizing the file, perhaps smaller than the probability of machine or network error.Property 3) cannot be avoided for the general case of trying to synchronize files that might have arbitrary data if we wish to do better than just transmitting the entire file.Basic Algorithm1. Receiver computes hashes for its blocks and transmits the hashes to the sender.2. Sender stores the hashes from the receiver.3. For each block of its version of the file, the sender first computes the hash of its block and then sees if it matches any of the hashes from the sender.CS 265, Spring 2004 Page 2 of 7 1/14/2019Rsync: Efficient File Synchronization Using Hashing David Shao4. If the sender finds a matching hash, it records that the receiver only needs to copy data from the receiver’s own blocks.5. If the sender does not find a matching hash, it records that the sender needs to send to the receiver the unmatched block.6. Sender transmits to the receiver the sequence of commands from parts 3 and 4 that tell the receiver how to reconstruct the sender’s version of the file.For the ideal case where most corresponding blocks are identical, the above algorithm will decrease the amount of data that needs to be transmitted by approximately the ratio between the size of a hash of a block and the full size of a block.However, there is an implicit assumption that the regions where the files match are neatly lined upwith each other in corresponding blocks. For an easy example where that assumption is not true,consider the common case where one is editing a source code file, perhaps adding a small comment at the beginning. That simple change of adding a few bytes throws off all of the alignments of the blocks, potentially resulting in all blocks being different in some small part, whilemost of the files are the same if the positioning of one were to be shifted slightly.Weak and Strong HashesTo deal with the case of small alterations, Tridgell proposed having the sender recalculate a hash not just at every block, but at every byte. But if a hash is calculated at every byte, the amount of work is increased by the size of a block. If the blocks were sized around 100 bytes, the amount of work keeping the same hash would be two orders of magnitude more, and Tridgell profiled blocks of larger size. In order for this extra work to be reasonable, the hash must be weakened.In a seeming contradiction, the hash cannot be too weak because there would be false matches of blocks, leading to corruption of files; yet the hash cannot be too strong or it would take too longto calculate for each byte. Tridgell’s solution is to use two hashes: a weaker one that can be easily calculated that weeds out most of the non-matches and a stronger one that is less frequently calculated that ensures with extremely high probability that there is no false match.MD4 as the Strong HashAccording to Tridgell, he chose MD4 as his strong hash algorithm because it was readily available, it was relatively fast, and it had a decent reputation. Tridgell’s rsync algorithm is not dependent on the exact strong hash method other than the important property that the sender and receiver must agree on which hash functions to use in common.Modifications to the Idealized Algorithm1’. The receiver calculates both a weaker and a stronger hash for each block and sends both setsof hashes to the sender.2’. The sender stores both sets of hashes. Tridgell’s initial implementation

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

SJSU CS 265 - Rsync

Sign up for free to view:

Please select your school