/* Copyright notice: Can be reproduced arbitrarily, please be sure to indicate the original source of the article and the author information . */
Author: Zhang Junlin
Excerpt from "Big Data Day know: Architecture and Algorithms" Chapter 14, book catalogue here
for the calculation of offline mining class diagram, there are many practical systems with excellent performance and characteristics, such as Pregel, Giraph, Hama, Powergraph, Graphlab, Graphchi, etc. Through the analysis of these systems, we can sum up some common calculation models of off-line mining class diagram calculation.
This section divides common computational models into two categories, one is the graph programming model and the other is the graph calculation paradigm. The programming model is more oriented to the application developers of graph computing systems, while the computational paradigm is a concern to the developers of the graph computing system. In this section, about the programming model, we mainly introduce the node-centric programming model and its improved version of the gas programming model; With regard to computational paradigms, the synchronous execution model and the asynchronous execution model are highlighted. These kinds of models have been widely used in the current large-scale graph mining system.
14.4.1 node-centric programming model
The node-centric programming model (Vertex-centered Programmingmodel) is first proposed by the Pregel system, and most of the offline mining-type large scale graph computing systems use this model as a programming model.
For figure G= (V,E), the node-centric programming model vertexthe graph node? V as the center of Computation, the application developer can customize a node update function (vertex)that is closely related to a specific application, which can acquire and change the graph node vertex And the weights of the edges associated with them, you can even change the graph structure by adding and removing edges. Node Update function functions (vertex) are performed for all nodes in the graph to convert the state of the graph (including node information and edge information) so that iterations are repeated until a certain stop criterion is reached.
Typical Graph node update function functions (vertex) basically follow the following logic.
That is, the information is collected from the vertex and out edges, and after the function F() transformation of the node weights, the computed values are updated to the weights of the vertex , then the new weights of the nodes and the original weights of the edges as input , the transformed value is used to update the weights of the edges sequentially by transforming the function G() for the edge. The purpose of updating part of the graph State is achieved by vertex the node update function.
The node-centric programming model has strong expressive ability. Research has shown that many types of problems can be expressed through this programming model, such as many graph mining, data mining, machine learning and even linear algebra problems can be solved by this programming model. This is also why the graph node-centric programming model is the root cause of its path.
14.4.2 Gas programming model
Gas model can be regarded as a fine-grained transformation of the node-centric graph computing programming model, and the computational concurrency is increased by further subdividing the calculation process. The gas model explicitly divides the node-centric graph calculation model's nodal update function functions (Vertex) into three successive processing stages: the information gathering phase (Gather), the application phase (apply), and the distribution phase (scatter). Through this clear calculation phase division, the original one complete calculation process can be subdivided, so that in the calculation process can be executed concurrently to further increase the concurrency of the system processing performance.
This assumes that the node that is currently being computed is uand is based on this to illustrate the gas model.
during the information gathering phase, the information on all the adjacency nodes and the connected edges of the
node is collected through a general accumulation function:
Through the operation of the above three stages, we can define a highly abstract gas calculation model centered on a graph node. In the gas model, how the incoming and the outbound edges of a node is used in the information gathering and distribution phase depends on the specific application, for example, in the PageRank calculation, the phase only considers the edge information, the distribution phase only considers the edge information, but in a social graph like Facebook, If the semantics of an edge expression is a friend relationship, then the information in the information gathering and distribution phase is all the edges are included in the calculation.
14.4.3 Synchronous Execution Model
The synchronous execution model is relative to the asynchronous execution model. We know that graph calculation often needs to go through multiple rounds of iterative process, in the node-centric graph programming model, in each iteration of the graph node will invoke the user-defined function functions (vertex), this function will change the vertex The State of the node and its corresponding edge, if this state change of the node can be seen and used by other nodes during this iteration, that is, the change is immediately visible, then this pattern is called the asynchronous execution model; If all the state changes are visible only until the next iteration and are allowed to be used, Then this pattern is called the synchronous execution model. Synchronous execution model of the system in the iterative process or between successive two-round iterative process there is often a synchronization point, the purpose of the synchronization point is to ensure that each node has received the current iteration of the updated state information, so as to ensure that the next round of iterative process can be entered.
in the actual system, two typical synchronous execution models include the BSP model and the MapReduce model. The introduction of the BSP model and its relationship to the MapReduce model can be referred to in this book, "Machine Learning: Paradigm and Architecture", which is not covered here. The MapReduce calculation model in graph calculation is described below, in general, because many graph mining algorithms have the characteristics of iterative operation, the MapReduce computational model is not a good answer to solve such problems, but because of the widespread popularity of Hadoop, In the actual work, some graphs are calculated using the MapReduce mechanism.
14.4.4 Asynchronous Execution Model
Asynchronous execution model is relative to the synchronous execution model, because the data synchronization is not required, and the updated data can be used in this iteration, so the algorithm converges fast, and the system throughput and execution efficiency are significantly higher than the synchronization model. But the asynchronous model also has the corresponding disadvantage: it is difficult to infer the correctness of the program. Because the data update takes effect immediately, the different execution order of the nodes is likely to result in different running results, especially when the graph node is updated concurrently, it may produce contention conditions (Race Condition) and inconsistent data. So it must consider how to avoid these problems when the system is implemented, and the system implementation mechanism is more complicated than the synchronous model.
The following is a case study of the data consistency problem of the asynchronous execution model, and Graphlab is more suitable for the non-natural graph calculation of machine learning, such as Markov random field (MRF), stochastic gradient descent (SGD), and other Graphlab algorithms.
Before explaining the problem of data consistency in asynchronous models, let's look at the scope (scope) concept of the graph node presented in Graphlab paper. For a node v in Figure G, its scope Sv includes: node v itself, all edges associated with node v , and all adjacency graph nodes of node v . The scope of the diagram node is defined because in the node-centric programming model, the scope reflects the range of graph objects that the node update function f(v) can relate to and the data it binds to.
Under the concurrent asynchronous execution model, you can define three types of data consistency conditions with varying intensities (see figure 14-12), depending on the intensity of their consistency constraints, from strong to weak: full-consistency, edge-consistency (edge Consistency) and node consistency (Vertex consistency).
The meaning of full consistency is that during the node update function f (v) of node v , there is no other update function to read or change the data of the scope Sv Neatline object of Node v . Therefore, in cases where full conformance conditions are met, parallel computing allows only the presence of graph nodes with no public adjacency points, because if the two graph nodes have public adjacency graph nodes, the scopes of the two must intersect, and if they are executed concurrently, a race condition may occur, which violates the definition of full consistency.
Slightly weaker than full consistency is the edge consistency condition, which means that during the node update function f(v) of node v , there are no other update functions to read or write or to change node v, And the data for all the edges adjacent to it. That is, the condition is relaxed compared to the full consistency condition, allowing the reading of data from other graph nodes adjacent to Node v . Under the condition of satisfying edge consistency, parallel computing allows for the presence of a graph node without a common edge, because the edge consistency condition must be satisfied as long as there are no shared edges for two nodes u and v .
What is weaker is node consistency, which means that during the node update function f(v) of node v , there will be no other update function to read or write or change node v data. It is clear that the weakest node consistency allows for maximum concurrency, because the constraints are weak, because unless the application logic guarantees that the node update function f(v) only reads and writes the data of the node itself, it is prone to race conditions that make the results of the program inconsistent.
The selection of different consistency models has a great influence on the correctness of the results of parallel program execution, so the results of the so-called parallel execution can be judged by consistency with the sequential execution. Therefore, you can define "sequence consistency" as follows:
If there is always a consistent execution of the sequence execution for all possible concurrent execution orders, in which case we can call this concurrency program to satisfy the sequence consistency.
Satisfying sequence consistency can help us verify the correctness of transforming a sequence of executed programs into a parallel execution program. In the parallel asynchronous graph computing environment, the following three kinds of situations can satisfy the sequence consistency.
Scenario One: satisfies the full consistency condition.
Scenario two: satisfies the edge consistency condition, and the node update function f(v) does not modify the data of the adjacency node.
Scenario three: Satisfies the node consistency condition, and the node update function f(v) will read and write only the data of the node itself.
The above three scenarios can be used by the application to design the algorithm, in order to balance the concurrency and the correctness of the result: the weaker the consistency condition, the higher the concurrency, but the higher the probability of contention, the more difficult the result may be to ensure the correctness of the results. If the application can clarify the scope of the data of the node update function, it can choose according to the above-mentioned situations, and better achieve the concurrency performance under the precondition of guaranteeing the correctness of the result.
Offline mining calculation model of Big Data graph database