Excerpt from "Big Data Day know: Architecture and Algorithms" Chapter 14, book catalogue here
For the large amount of data to be excavated, in the distributed computing environment, the first problem is how to distribute the data evenly to different servers. For non-graph data, this problem is often more intuitive to solve, because there is no independent correlation between records, so the data segmentation algorithm is not particularly constrained, as long as the machine load as balanced as possible. Because of the strong coupling between the data records, if the data fragmentation is unreasonable, not only will cause the load imbalance between the machines, but also greatly increase the network communication between the machines (see Figure 14-5), and then consider that the graph mining algorithm often has multi-wheel iterative operation characteristics, which will significantly enlarge the data slice unreasonable influence, Seriously slow down the overall operating efficiency of the system, so the rational segmentation of the graph data for offline mining type diagram application of the operational efficiency is very important, but this is not yet a very good solution to a potential problem.
for the slices of the graph data, what is a reasonable or good way to slice? What should the judging criteria be? As shown in the example above, it is important to consider two factors: machine load balancing and total network traffic. If machine load balancing is considered separately, it is best to distribute the graph data to each server as evenly as possible, but this does not guarantee that the total amount of network traffic is as little as possible (refer to Figure 14-5, the right-hand cutting method, the load is more balanced, but the network communication is more), if the network communication is considered separately, It is possible to put all the nodes of the dense connected sub-graph on the same machine as much, so as to effectively reduce the network traffic, but it is difficult to load balance between the machines, a large dense connected sub-graph will cause a high load of a machine. Therefore, the reasonable slicing method needs to find a more stable equilibrium point between these two factors, in order to optimize the overall performance of the system.
The following describes two types of cutting graph data from different starting points, and introduces the typical specific segmentation algorithm and its corresponding mathematical analysis, the first thing to emphasize: in the selection of specific segmentation algorithm is not more complex algorithm is more likely to be adopted in the actual system, the reader can think of the truth, in the back will give answers.
14.3.1 Cutting Edge Method (Edge-cut)
Now the problem is: given a huge graph data and p -machine, how to cut it into a p -chart? There are two different ways to solve this problem of graph cutting.
CHEBENFA represents the most common way of thinking, where the cutting line can only pass through the edges of the connection graph node, and the complete graph is divided into p sub-graphs by cutting the edges. Figure 14-6 represents the distribution of 7 nodes to 3 machines, and the left side shows the tangent method, the number of the graph node represents the machine number to which the node is distributed.
After cutting the graph data, any graph node will only be distributed to a machine, but the cut-off edge data will be stored in both machines, and the cut-off edge in the graph calculation means the remote communication between the machines. Obviously, the additional storage overhead and communication overhead that the system pays depends on the number of edges being cut, and the more edges that are passed when the graph is cut, the higher the storage overhead and communication overhead the system needs to carry.
As mentioned earlier, there are two considerations for measuring data fragmentation of graphs: load balancing and machine traffic, so for CHEBENFA, all of the specific cutting algorithms pursue the goal is to get the minimum number of cut edges by assigning the graph nodes to the different machines in the cluster as evenly as possible.
That is, the method of cutting edges is the least possible, under the condition that every machine is distributed to the same conditions. Among them, | V|/p means that all nodes are divided by p Machine,l≥ 1 represents the imbalance regulator factor, by adjusting the size of L can control the uniformity of the node distribution, when its value is 1 o'clock, The higher the value, the greater the degree of imbalance that is required for full sharing.
As can be seen from the formal description above, LAMDA is approximately equal to 1, the problem is essentially a balanced p -Path Partitioning (Balanced P-way partitioning) problem in graph cutting, There are a lot of relevant studies to solve this problem (readers of interest can read this chapter [4]), but because the graph cutting algorithm has a high time complexity, it is less suitable for large-scale data processing, so it is seldom used in real large-scale data scenarios.
In the actual graph computing system, the common strategy is the node random sharing method, that is, through the hash function to divide the nodes into the cluster of machines, do not carefully consider the edge cutting situation. Both Pregel and Graphlab adopted this strategy. The advantages of this method are fast, simple and easy to implement, but from theorem 14.1 It can be proved that this method will cut most of the edges of the graph.
By theorem 14.1, it is assumed that the cluster consists of 10 machines, the cut edge ratio is approximately 90%, that is, 90% of the edges will be cut, and if 100 machines are included, 99% of the edges will be cut. It can be seen that this kind of segmentation is inefficient.
14.3.2 Tangent point Method (Vertex-cut)
The tangent method represents a different way of thinking about another cut chart. Unlike the tangent edge method, when cutting the graph, the cutting line can only be through the graph node rather than the edge, and the graph node cut by the cutting line may appear in multiple cut sub-graphs at the same time. Figure 14-6 the right side of the tangent method, it can be seen that the center of the node is cut into three parts, which means that the node will appear in the same time after the cut of the three sub-graph.
In contrast to CHEBENFA, each edge is only distributed to a single machine and is not stored repeatedly, but the cut nodes are stored repeatedly in multiple machines, so there is also an additional storage overhead. In addition, the problem with such a cut is that the graph algorithm will constantly update the value of the graph node in the iterative process, because a node may be stored in multiple machines, that is, there are multiple copies of the data problem, it is necessary to solve the graph node value data consistency problem. A typical solution to this problem is to explain the powergraph system later on.
So, since the edges in the tangent chart are not cut, is there no need for communication overhead between machines? This is not the case, and the communication overhead will still be incurred in maintaining the data consistency of the cut graph node values. Therefore, for the tangent point method, all the specific algorithm pursues the reasonable segmentation goal is: How to distribute the edge data to the cluster's machine as evenly as possible, minimizing the number of graph nodes that are cut open.
The method of minimizing the average number of replicas is the one where each machine is distributed to the edge as evenly as possible. Among them, | E|/p for all sides by p -machine evenly distributed values,l≥ 1 represents an imbalance regulator factor, by adjusting the size of L can control the uniformity of the edge distribution, when its value is 1 o'clock, The higher the value, the greater the degree of imbalance that is required for full sharing.
Similarly, because of the complexity of the complex graph cutting algorithm, the time complexity is too high, so the most commonly used in the actual system is the edge random evenly
The edge distribution of most graphs in the real world follows the power law, and the theory and practice have proved that the edge random averaging method which belongs to the tangent point method is stronger than the Chebenfari node random sharing method, and its computational efficiency is at least one order of magnitude for the graph data following this rule. So overall, for the general situation of the graph data, the use of tangent point method is obviously better than the tangent edge method.
Think: Why aren't the more complex and effective segmentation algorithms more popular?
Answer: In general, the graph mining algorithm is divided into two stages.
Phase one: Centralized graph data segmentation and distribution; Phase two: Distributed graph calculation.
If the use of complex graph cutting algorithm, the system load balanced, less inter-machine communication, so the second stage of high efficiency, but the use of complex algorithms not only high cost of development, in the first phase of the cost is also high, and even the cost of time to pay higher than in the second phase of the efficiency gains generated, Therefore, the choice of the segmentation algorithm also requires a global efficiency tradeoff.
Data fragmentation of Big Data graph database