The simple realization of shuffle

Source: Internet
Author: User
Tags serialization shuffle

In the implementation of distributed system, shuffle is a very critical operation, which directly determines the feasibility of the algorithm executing in distributed environment. Shuffle is called data blending, and it is called exchange in some parallel systems. Shuffle the simplest way to do this is to make some sort of strategic segmentation of the data (such as Hachiche, for example, range segmentation), and then select the data to be transferred directly after the Shard or to persist the data for fault tolerance, and we compare the two ways. 1, direct transmission: When the data cut in the transmission immediately, will not pass through the disk, directly from memory to another node in memory, of course, this can make the network maximum speed, but not fault tolerance, unless you design the system every day running Single-digit machine (personal point of view), or your homework a little wrong, back to run. Distributed operations generally involve more data and a long time to execute, we can calculate trade-off, obviously not cost-effective. 2, after the persistence of transmission although slow, but your fault tolerance is guaranteed, because the data on the disk, but you can only get up to 1/2 of the network bandwidth limit (personal implementation certificate).

Empty talk are virtual, directly to see the simple implementation of shuffle, we will discuss the situation of disk compression, in the classical mapreduce theory, can be divided into two nodes, a map phase, a reduce phase, the map node will be fault-tolerant, that is, compression disk. We are divided into upper and lower ends, because the data bit pull policy, starting from the root node of the execution topology, reduce end is the top, the map end is the lower part:

The upper end will send the execution subtree of its location to the lower end (we first consider the upper case) and notice what we need to do in this step. Because you want to transfer to a remote node for execution. We speak from the most original place, transmission involves the network, the network in the c++/c you can choose the library, but in order to challenge yourself, you can directly use the Syscall function socket (). The socket is built first, returns a file descriptor, binds the file descriptor to the address + port, and listens to the file descriptor (called Network IO and disk IO, which can be represented by a file descriptor) listen. The next step is to serialize the execution subtree, which is very important, imagine you can send a piece of code to pass. Can not (you send the code to compile Ah, furthermore, you send a section of the remote code, also need to send the dynamic link library), can only be a serialized object sent to the past. So the serialized object is sent after it receives the serialized object sent at the top, immediately after the serialization, call execution, where the actor Concurrent Library (Theron) can be used. Since the execution of the code after the serialization, the lower end will open the socket and upper communication (because the top has been established socket waiting to connect), the bottom immediately connect the top, with the Accept function to accept, return the FD mapping, and because there are more nodes at the bottom, Returns an array of FD, which can be used to listen, get the data sent from the bottom (we can imagine that the server has the FD mapping of each client side in the Server-client-side program so that it can communicate), and now it's time to transfer the data. The lower end starts to "make" the data, which can be pressed onto the disk, record (this record you can record it to the master node, or a special Mapouttracker), and then transfer, transmission, in accordance with the partition number to the specific local corresponding file descriptor. The file descriptor exists at the local connect time, and the local data is written to this file descriptor. Through the network will reach the upper end.

The following picture is a two to three connection building schematic:


The lower end of the data to the disk, this is a phase, this pipeline will disconnect, and then send the data from the disk, you can have a buffer (this buffer if present, is a partition buffer, is stored is a block matrix, is a block , and partition it, there can be no buffer, if there is a buffer, is a production consumer problem, if there is no buffer, direct transmission. Upper end. There can also be buffers, but this buffer is not partitioned because all blocks to this node belong to the same partition. If not. In the order of the transmission.

But if it's not persisted to disk. Direct transmission, the upper and lower end completely pipeline up. Just imagine, a piece of data, direct transmission, do not need the cache layer, but now by block transmission. Once the block is transferred, the cache must be used because you don't know which block will be full. Does the upper end need a buffer? Just think of the situation that is not needed, that is to collect a piece of hair, but a piece of hair is synchronized, synchronization will be very slow, will also cause the lower end of the card, so the asynchronous receiver put into the buffer (Syscall function Select ()), reuse the production of consumer mode Pthread_ Create another thread to take the data away from the buffer, so if it's all pipeline, two buffers are unavoidable, and the upper and lower ends are all there.

(All words ah, not finished, but also paint and paste code:)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.