When you want to copy large files between two machines, combining nc (netcat) and pigz (parallel gzip) is a simple and efficient choice. However, if you want to distribute these files to multiple machines at the same time, how can this problem be solved? In Tumblr, this is quite a common requirement, for example, when we want to quickly add several MySQL Slave servers at the same time.
You can copy data from the source machine to the target machine one by one, but the time is often doubled. Alternatively, you can copy data from the source machine to multiple target machines at the same time. However, due to factors such as the bandwidth of the source machine, the speed is not necessarily fast.
Fortunately, you can do better with some UNIX tools. The combination of tee and FIFO can form a fast file distribution chain: Each machine in the distribution chain stores files and distributes them to the next link.
First, select a target machine as the last part of the distribution chain. On this machine, you only need to use nc listening (assuming the port is 1234), and then decompress it by pigz through the pipeline, the pipeline is used to submit the data to tar for decomposition.
- Nc-l 1234 | pigz-d | tar xvf-
Then, go up from the end of the distribution chain and set other target machines. It also needs to be monitored, decompressed, and decomposed, however, before decompression, we use the tee command to output the data to the named pipe (FIFO). Another shell pipe will distribute the unzipped data to the next link of the distribution chain at the same time:
- Mkfifo myfifo
- Nc hostname_of_next_box 1234 nc-l 1234 | tee myfifo | pigz-d | tar xvf-
Finally, start the distribution chain on the source machine to transfer data to the distribution chain:
- Tar cv some_files | pigz | nc hostname_of_first_box 1234
In my tests, each machine in the distribution chain may lose 3%-10% of the performance (compared to 1-to-1 copy ), however, the efficiency is significantly improved by copying one machine one by one or distributing it to multiple machines at the same time.