Rsync working mechanism (translation), rsync working mechanism Translation
This article is an official rsync recommendation article on How Rsync Works translation. The main content is Rsync terminology and a simple version of rsync Working principles. This article does not translate all of the articles. The preface is skipped directly, but the original content of the preface is retained for the sake of the integrity of the article.
How Rsync Works
A Practical OverviewForeword
The original Rsync technical report and Andrew Tridgell's Phd thesis (pdf) Are both excellent documents for understanding the theoretical mathematics and some of the mechanics of the rsync algorithm. unfortunately they are more about the theory than the implementation of the rsync utility (hereafter referred to as Rsync ).
In this document I hope to describe...
- A non-mathematical overview of the rsync algorithm.
- How that algorithm is implemented in the rsync utility.
- The protocol, in general terms, used by the rsync utility.
- The identifiable roles the rsync processes play.
This document be able to serve as a guide for programmers needing something of an entr é into the source code but the primary purpose is to give the reader a foundation from which he may understand
- Why rsync behaves as it does.
- The limitations of rsync.
- Why a requested feature is unsuited to the code-base.
This document describes in general terms the construction and behaviour of Rsync. In some cases details and exceptions that wocould contribute to specific accuracy have been sacriiced for the sake meeting the broader goals.
Processes and Roles
When we discuss rsync, we use some special terms to represent different processes and their roles in task execution. It is important for humans to use the same language for easier and more accurate communication. Similarly, in a specific context, it is also important to use fixed terms to describe the same thing. In the Rsync email list, some people often have doubts about role and processes. For these reasons, I will define some terms about role and process that will be used in the future.
Client |
Role |
The client starts the synchronization process. |
Server |
Role |
When the client is locally transmitted or connected through a remote shell or network socket, it can be a remote rsync process or a remote system. Server is a general term. Do not confuse it with daemon. |
|
|
After a connection is established between the client and the server, the sender and receiver role are used to distinguish them. |
Daemon |
Role and process |
A rsync process waiting for connection from the client. In some specific platforms, it is often called a service. |
Remote shell |
Role and set of processes |
One or more processes that provide connections between the Rsync client and remote rsync server. |
Sender |
Role and process |
A process that will access the synchronized source file. |
Cycler |
Role and proces |
When the volume er is a target system, it will act as a role. When the volume er is a process that updates data and writes data to the disk, it will act as a process. |
Generator |
Process |
The generator process identifies file changes and manages file-Level Logic. |
Process Startup
When the Rsync client is started, a connection is established with the server first. both ends of the connection can communicate through pipelines or network sockets.
When Rsync communicates with a remote Non-daemon server through a remote shell, the process is started by fork remote shell, which starts an Rsync server process on the remote system. Both the Rsync client and the server communicate through the pipe between the remote shell. During this process, the rsync process does not involve the network. In this mode, the rsync process options on the server are passed by a remote shell.
When rsync communicates with rsync daemon, it uses a network socket for communication. This is the only rsync communication method that can be called network-aware. In this mode, the rsync option must be sent through a socket. The specific content is described below.
When the client communicates with the server, both parties will send the maximum Protocol version number to the other party, and both parties will use a smaller version of the protocol for transmission. If the connection is in daemon mode, the rsync option will be sent from the client to the server, and then the exclude list will be transmitted. From this moment on, the relationship between the client and the server is only related to errors and log message transmission. (Note: Starting from now on, the two roles sender and consumer er will be used to describe the two ends of the rsync connection)
The Processing Method of local Rsync tasks (both source and target are in the local file system) is similar to push. The client (Translator's note: At this time, the source file end) becomes a sender, and fork a server process to fulfill the role of the worker er, then the client/sender and server/worker er communicate through the pipeline.
The File List
File list contains not only the path name, but also attributes such as copy mode, owner, permission, file size, and mtime. If the "-- checksum" option is usedFile level.
When the rsync connection is established, the first thing is that the sender creates its file list. After the file list is created, each item in the file list will be transmitted (shared) to the receiver.
After this is done, both ends will sort the file list according to the path relative to the base directory (the sorting algorithm depends on the Protocol version number transmitted). After sorting is complete, reference to all files will be searched through the index in the file list.
After the file list is received by the aggreger, the generator process is fork, which completes the pipeline together with the aggreger process.
The Pipeline
Rsync is highly streamlined (pipelined ). This means that processes communicate in a single direction. When the file list has been transferred, the pipeline action is as follows:
generator --> sender --> receiver
The output result of generator is the sender input, and the output result of sender is the worker input. Each process runs independently and is delayed only when pipeline is blocked or waiting for disk I/O and CPU resources.
(Note: although they are in a single direction, each process immediately transmits data to the receiving process after processing the relevant work and starts to process the next job, after receiving the data, the receiving process starts to process the data. Although they work in a pipeline, they work independently and in parallel, with no latency or congestion)
The Generator
The generator process compares the file list with the local directory tree. If the "-- delete" option is specified, before the main function of the generator starts, it will first identify local files not on the sender side (Note: This generator is a worker-side process ), delete these files on the recevier side.
Then generator will start its main work, which will forward a file from the file list one by one. Each file is detected to determine whether it needs to be skipped. If the file's mtime or size is different, the most common file operation mode will not ignore it. If the "-- checksum" option is specified, the file-level checksum is generated and compared. Directories, Block devices, and symbolic links are not ignored. The missing directory will also be created on the target.
If the file is not ignored, the existing file versions in all target paths will be usedBasis file)These benchmark files are used as data matching sources, so that the sender does not need to send parts that match these data sources ). To achieve this remote data matching, a block checksum code (block checksum) will be created for the basis file and sent to the sender immediately after the file index number (file id. If the "-- whole-file" option is specified, an empty block check code will be sent for the new file (that is, the file that the sender has but the file that the consumer does not. (Note: generator sends the block verification code set of each file to the sender immediately, instead of sending the block verification codes of all files once)
The block size and block checksum size of each file are calculated based on the file size. (Note: The rsync command allows you to manually specify the block size ).
The Sender
The Sender process reads data from generator and reads the ID number of a file and the set of block verification codes of the file each time ).
For each file sent by generator, sender stores block verification codes and generates their hash indexes to speed up searching.
Then read the local file and generate a checksum for the data block starting from the first byte. Then, query the checksum set sent by the generator to check whether the checksum matches an item in the set. If no match is found, the unmatched byte is appended to the unmatched data block as an additional attribute, identifies where the unmatched data block starts, and continues generating and comparing the verification code from the next byte (that is, the second byte) until all data blocks are matched. This implementation method is called rolling Verification "rolling checksum ".
If the block verification code of the source file matches an item in the above verification code set, it is considered that the data block matches the block, and then all the accumulated non-file data (note: such as the data block reorganization command, file id, etc.) will be sent to the receiver er end along with the offset and length of the matching data block of the corresponding file of the receiver end: for example, if the matching block corresponds to the first data block of the file on the receiver side, the sending offset, the matching block number, and the length value of the data block are fixed, however, when a file is split into a fixed data block, the size of the last data block may be smaller than the fixed size. Therefore, to ensure full length matching, you also need to send the length value of the data block), and then the generator process will scroll to the next byte of the matching block to continue to calculate the verification code and compare the matching(Note: Data blocks can be matched here. The scroll size is a data block. For unmatched data blocks, the scroll size is one byte).
In this way, all matching data blocks can be identified even if the data block sequence or offset of the files at both ends are different. In The rsync algorithm, this processing process is very core.
In this way, sender sends some commands to the receiver that tell the receiver how to reorganize the source file into a new target file. In addition, these commands detail all matching data blocks that can be directly copied from the basis file when the new target file is reorganized (the premise is that they already exist on the explorer end ), it also contains all bare data that does not exist on the worker side (Note: pure data ). At the final stage of each file processing, a whole-file verification code is also sent (the Translator's note: This is a file-level verification code), and then the sender starts to process the next file.
A good CPU is required to generate a rolling checksum and search for matching results from the checksum set. Among all rsync processes, sender consumes the most CPU.
The aggreger
The volume er reads the data sent from the sender and identifies each file through the file index number. Then it opens the local file (called the basis file) andCreate a temporary file.
Then, the receiver reads unmatched data blocks (pure data) and additional information of matched data blocks from the data sent by the sender. If no matching data block is read, the pure data is written to the temporary file. If a matching record is received, the aggreger searches for the offset of the data block in the basis file, copy the matched data blocks to the temporary files. In this way, temporary files are created from the beginning until they are created.
When the temporary file is created, the verification code of the temporary file is generated. Finally, the verification code will be compared with the verification code sent by the sender. If the comparison finds that the verification code does not match, the temporary file will be deleted and the file will be rebuilt in the second stage, if two failures occur, the report fails.
After the temporary file is fully created, set its owner, permission, and mtime, rename it, and replace the basis file.
In all rsync processes, because the processor will copy data from the basis file to a temporary file, it is the process that consumes the most disk. Because small files may remain in the cache, disk I/O can be mitigated. However, for large files, the cache may be washed away as the generator has been transferred to other files, and the sender will cause further delay. Since data may be randomly read from one file and written to another file, if the working set is larger than the disk cache, the so-called seek storm may occur, this will reduce performance again.
The Daemon
Like many other daemon, a daemon sub-process is fork for each connection. At startup, it will parse the rsyncd. conf file to determine which modules exist and set global options.
When a defined module receives a connection, daemon will fork a sub-process to process the connection. Then the sub-process reads the rsyncd. conf file and sets the option of the requested module. This may cause chroot to the module path and delete the setuid/setgid of the process. After the above process is completed, the daemon sub-process will be the same as the normal rsync server, and may assume the role of sender or aggreger.
The Rsync Protocol
A well-designed communication protocol has a series of features.
- All the items to be sent are clearly defined in the data packet, including the header, optional body or data load.
- Each packet header specifies the protocol type or command line.
- The length of each data packet is clear.
In addition to these features, the Protocol should also have different degrees of state, independence between data packets, human readability, and the ability to reconstruct Disconnected Sessions.
The rsync Protocol does not include any of the above features. Data is transmitted through uninterrupted byte streams. In addition to non-matched data, no length specifiers or length counters are specified. On the contrary, the meaning of each byte depends on the context defined by the protocol layer.
For example, when sender is sending a file list, it simply sends entries in each file list and uses an empty byte to terminate the entire list. Generator sends the file number and block verification code set in the same way.
In reliable connections, this communication method works well and has less data overhead than the formal protocol method. But unfortunately, this also makes the help documentation and debugging process very obscure. The Protocol of each version may be slightly different, so you can only know the exact protocol version to predict what has changed.
Notes
This document is a work in progress. the author expects that it has some glaring oversights and some portions that may be more confusing than enlightening for some readers. it is already ed that this cocould evolve into a useful reference.
Specific suggestions for improvement are welcome, as wocould be a complete rewrite.
Back to series article outline: http://www.cnblogs.com/f-ck-need-u/p/7048359.html
Reprinted please indicate the source: http://www.cnblogs.com/f-ck-need-u/p/7221535.html