Excerpted from Chapter 14 "Big Data daily notice: Architecture and algorithms", the book directory is here for massive data to be mined, in a distributed computing environment, the first problem is how to evenly distribute data to different servers. For non-graph data, this problem is often more intuitive to solve, because records are independent from each other, so the data is switched
Excerpted from Chapter 14 "Big Data daily notice: Architecture and algorithms", the book directory is here for massive data to be mined, in a distributed computing environment, the first problem is how to evenly distribute data to different servers. For non-graph data, this problem is often more intuitive to solve, because records are independent from each other, so the data is switched
This is excerpted from Chapter 14 "Big Data day: Architecture and algorithms". The books are listed in
In a distributed computing environment, the first problem facing massive data to be mined is how to evenly distribute data to different servers. For non-graph data, this problem is often solved more intuitively. Because records are independent and unrelated, there is no special constraint on the Data splitting algorithm, as long as the server load is balanced as much as possible. Due to the strong coupling between graph data records, improper data sharding may not only result in load imbalance between machines, but also greatly increase network communication between machines (see Figure 4-5 ), considering that graph mining algorithms often have the characteristics of multiple rounds of iterative operation, this will obviously enlarge the unreasonable impact of data slicing and seriously slow down the overall operating efficiency of the system, therefore, the rational splitting of graph data is very important for the operation efficiency of offline mining graph applications, but this is a potential problem that has not yet been well resolved.
What is a reasonable or good way to slice graph data? What is its judgment standard? As shown in the preceding example, the two main factors for measuring the rationality of graph data slicing are Server Load balancer and total network communication volume. If Server Load balancer is considered separately, it is best to evenly distribute the graph data to each server as much as possible, but this does not ensure that the total amount of network communication is as small as possible (refer to the 4-5 right side cutting method, load Balancing, but network communication is large). If network communication is considered separately, you can place all nodes in the dense connected subgraph on the same machine as much as possible, this effectively reduces network traffic, but it is difficult to achieve load balancing between machines. A large dense connected subgraph will lead to a high load on a machine. Therefore, a reasonable slicing method needs to find a stable balance between the two factors, in order to achieve optimal overall system performance.
The following describes two methods to cut graph data from different starting points, and describes typical Segmentation Algorithms and their corresponding mathematical analysis. You must first emphasize one point: when selecting a specific splitting algorithm, the more complicated the algorithm is, the more likely it will be adopted in the actual system. Readers can think about the truth and give answers later.
14.3.1 Edge-Cut)
Now the problem is: given a huge graph data andPHow to cut a machinePSubgraphs? There are two different ways to solve this graph cutting problem.
The cropping method represents the most common idea. The cutting line can only pass through the edge connecting the graph node and divide the complete graphPSubgraphs. 4-6 means that the graph of the seven nodes is distributed to three machines, and the left side shows the cutting edge method. The number of the graph node represents the machine number that the node is distributed.
After the graph data is cut by the cutting edge method, any graph node will only be distributed to one machine, but the split edge data will be saved on both machines, in addition, the cut edge means remote communication between machines during graph computing. Obviously, the additional storage overhead and opening/selling of the system depend on the number of edges to be cut. The more edges a graph uses during cutting, the higher the storage overhead and the higher the opening and sales of communication.
As mentioned in the previous article, there are two considerations for determining whether the graph data partition is reasonable: Server Load balancer and machine traffic. Therefore, for the cropping method, all specific cutting algorithms are pursuing the following goals: how to minimize the number of cut edges by assigning graph nodes evenly to different machines in the cluster as much as possible.
That is, the method for finding the least cutting edge under the condition that each machine is distributed to the nodes as evenly as possible. Where, |V|/PIndicates that all nodes arePThe value obtained from the average calculation of machines,L≥1 indicates the imbalance adjustment factor.LThe size of can control the uniformity of node allocation. When the value is 1, it is required to be completely evenly distributed. The larger the value, the higher the degree of imbalance allowed.
From the formal description above, we can see that when lamda is equal to or equal to 1, this problem is essentially a balance in graph cutting.PBalanced p-way Partitioning however, because of the high time complexity of the graph cutting algorithm, it is not suitable for processing large-scale data, so it is rarely used in real large-scale data scenarios.
In the actual graph Computing System, the common strategy is to randomly divide nodes. That is, the hash function is used to evenly divide nodes to each machine in the cluster without careful consideration of edge Cutting. Both Pregel and GraphLab adopt this policy. This method is fast, simple, and easy to implement. However, from theorem 14.1, we can prove that this method will cut the vast majority of edges in the graph.
According to Theorem 14.1, if the cluster contains 10 machines, the cut edge proportion is about 90%, that is, 90% of the edges are cut, and if there are 100 machines, 99% of edges are cut. It can be seen that this splitting method is very inefficient.
14.3.2 Vertex-Cut)
The cut-point method represents different ways of cutting a chart. Unlike the cutting edge method, when cutting a graph, the cutting line can only pass through the graph node rather than the edge, the graph nodes to be cut and cut may appear in multiple subgraphs at the same time. The right side of 4-6 is the cut-point method. It can be seen that the node in the graph center is cut into three parts, that is, this node will appear in the three subgraphs after being cut at the same time.
In contrast to the cut-edge method, in the cut-point method, each side is distributed to only one machine and not stored repeatedly, however, the cut nodes are repeatedly stored in multiple machines, so there is also an additional storage overhead. In addition, the problem caused by such cutting is that graph algorithms constantly update the value of graph nodes during iteration, because a node may be stored on multiple machines, that is, the multi-copy data problem exists. Therefore, the consistency problem of the value data of the graph node must be solved. A typical solution will be provided for this problem in the following section.
So, since the edges in the cut-point method are not cut, Do machines do not need to communicate with each other? This is not the case. communication overhead is still incurred when maintaining the data consistency of the cut graph node values. Therefore, for the sharding method, the goal of all specific algorithms is to distribute Edge Data evenly to the machines in the cluster, minimize the number of open graph nodes.
That is, the method for finding the least average number of copies under the condition that each machine is evenly distributed to the edge as much as possible. Where, |E|/PIndicates that all edges arePThe value obtained from the average calculation of machines,L≥1 indicates the imbalance adjustment factor.LCan control the uniformity of edge distribution. When the value is 1, the full equi-tion is required. The larger the value, the higher the degree of imbalance allowed.
Similarly, because the time complexity of complex graph cutting algorithms is too high, edge random balancing is the most commonly used in the real system.
In the real world, the edge distribution of most graphs follows the power law. theories and practices have proved that for graph data that follows this law, the edge random equi-tion Method belongs to the cut-point method, which is stronger than the node random equi-tion Method in the cut-edge method, and its computing efficiency is at least one order of magnitude higher. Therefore, for graph data in general situations, the cut point method is much better than the cut edge method.
Please think: Why isn't the more complex and effective segmentation algorithms become more popular?
A: Generally, graph mining algorithms are divided into two stages.
Phase 1: centralized graph data segmentation and distribution; Phase 2: distributed graph computing.
If a complex graph cutting algorithm is used, the system load balancing is good and there is less inter-machine traffic. Therefore, the second-stage operation efficiency is high, but the complex algorithm is not only a high development cost, the time cost paid in the first stage is also very high, and even the time cost is higher than the efficiency benefit generated in the second stage.Global efficiency trade-offs are also required.