You are welcome to reprint it. Please indicate the source, huichiro.
Summary
The parallel processing of graphs has always been a very hot topic. There are two important topics here: first, how to parallelize graph algorithms, and second, how to find a suitable framework for parallel processing. As a very good parallel processing framework, spark is a natural task to move some parallel algorithms to it.
Graphx is a parallel implementation of some common graph algorithms on spark. It also provides a wide range of APIs. This article provides a preliminary study on the Code architecture of graphx and the specific implementation of PageRank in graphx.
Why did Google win the search engine war?
When Google was still in its infancy, Yahoo was booming in the search engine field. Obviously, there is a wall in front of Google that makes people have almost no hope.
However, the world is unpredictable. Now, "Foreign Affairs ask Google" has become an indisputable fact, and Yahoo should also accompany the customer.
One factor behind the formation of such conversions is that Google has invented the PageRank algorithm that significantly improves search accuracy. It is no exaggeration to say that the proposal of PageRank makes Google firmly stick to the search engine competition.
The search engine has several key factors to consider (personal opinion ).
- To attract users, you must have excellent search accuracy.
- With the help of users, advertising can be launched, and targeted advertising can be improved to make profits.
The above two aspects have excellent algorithms.
Return to the topic. PageRank is a specific application of graph theory.
Graph Theory
Graph theory is a very important part of discrete mathematics. Below is a undirected connected graph.
Vertex)
A, B, C, D, and E in a graph are called vertices.
Edge
The link between a vertex and a vertex is called an edge.
Mathematical representation of an image
When I was in college, I never wanted to understand why I had to learn the linear algebra of Lao shizi. Until these two daysThe beauty of MathematicsWhen I wrote a book, I realized that linear algebra is indispensable in some computer application fields.
We can easily understand the plane and three-dimensional ry (one is two-dimensional and the other is three-dimensional), while linear algebra solves a high-dimensional problem because it cannot be intuitively felt, so it is difficult. If you want to understand why there are so many branches in mathematics and their internal associations, we strongly recommend that you read them.A viewing tour of mathematics Bridge on advanced mathematics.
In mathematics, what is used to represent a graph? The answer is the matrix in a linear algebra. Think about it, the association matrix of the graph, and the adjacent matrix of the graph. In short, linear algebra is a matrix. The following is a specific example.
Parallel Graph Processing
As we mentioned earlier, graphs can be expressed in a matrix. To some extent, the parallelization problem of graphs is converted to the parallelization problem of matrix operations.
Take matrix multiplication as an example to see if it can be processed in parallel.
Take matrix A x B as an example to describe the parallel processing process.
Divide the preceding matrices A and B into four parts, as shown in
After the first alignment
Child Matrix Multiplication
After multiplication, the sub-matrix of A is moved to the left, and the sub-matrix of B is moved up.
Merge computing results
Parallel Graph Processing Framework, starting with Pregel
There are two key points in the previous section:
- Graphs are represented by matrices. Operations on graphs are matrix operations.
- The Matrix Multiplication operation can be parallelized to dynamically demonstrate its principles
You said OK, I understand. Which of the following is a suitable Parallel Processing framework that can be used for graph computing? You must have thought of mapreduce.
Although mapreduce is also a good parallel processing framework, it has many disadvantages in graph computing, mainly because the intermediate computing process needs to be stored on the hard disk, which is very inefficient.
Google specially proposed a remarkable framework Pregel for Parallel Graph processing. The dynamic view during execution is as follows.
Pregel has the following advantages:
- Cascade scalability
- Highly fault tolerant
- Common algorithms that can represent various graphs
Pregel Computing Model
Shows the computing model. There are three important
- Processing logic for each vertex vertexprogram
- Message sending, used for communication between adjacent nodes sendmessage
- Message merging logic messagecombining
Implementation of Pregel in spark
Thank you very much for sticking to the fact that this blog has a lot of content and is a little difficult. I want to extend the logic of the previous graph.
This graph indicates this,Graphx utilizes a parallel processing framework like spark to implement some parallel execution algorithms on the graph.
This blog post will express the words highlighted above. Please read them carefully.
- Whether the algorithm can be parallelized is irrelevant to spark.
- Whether an algorithm is parallelized or not must be proved by mathematics.
- It is wrong to use spark to implement the proved parallelization algorithm, because graphx supports the Pregel graph computing model.
Graph, an important concept in graphx
Undoubtedly, a graph itself is a very important concept in graphx.
Member variables
The important member variables in graph are
- Vertices
- Edges
- Triplets
Why should we introduce triplets? It is mainly related to the Pregel computing model. In triplets, the specific code of edge and vertex is recorded simultaneously.
Member Functions
Function categories
- Operations on all vertices or edges without changing the graph structure, such as mapedges and mapvertices
- Subgraph, similar to the filter subgraph in the set operation.
- Graph splitting, that is, the paritition operation, is critical for spark computing. It is precisely because different partition operations make parallel processing possible.PartitionstrategyThe benefits are different. Hash is used to divide the entire graph into multiple regions.
- Outer Join Operation of outerjoinvertices Vertex
Graph operations and operations graphops
The common algorithms of graphs are abstracted to the graphops class in a centralized way. They are implicitly converted to graphops in graph.
implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag] (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops
The following operations are supported:
- Collectneighborids
- Collectneighbors
- Collectedges
- Joinvertices
- Filter
- Pickrandomvertex
- Pregel
- PageRank
- Staticpagerank
- Connectedcomponents
- Trianglecount
- Stronglyconnectedcomponents
RDD
RDD is the core of the spark system. What new RDD is introduced in graphx?
- Vertexrdd
- Edgerdd
Compared with edgerdd, vertexrdd is more important, and there are many operations on it, mainly concentrated on vertex attribute merging. When it comes to merging, it has to be pulled.Relational algebra and Set TheorySo in vertexrdd, we can see many terms similar to SQL, such
As for the difference between leftjoin, innerjoin, and outerjoin, we recommend that you google it.
Storage and loading of graphx scene analysis Graphs
When performing mathematical calculations, the graph is represented by a matrix in linear algebra. How can this problem be stored?
When I learned the data structure, the teacher must have said a lot of methods, so I will not be so embarrassed.
However, in the big data environment, what if the graph is huge and the data indicating the vertex and edge is not enough to be placed in a file?Use HDFS
When loading, the memory of a machine is insufficient. What should I do?Delayed loading: when you really need data, you can distribute the data to different machines and adopt cascade.
In general, we will save all vertex-related content in a file vertexfile, and all edge-related information will be saved in another file edgefile.
When a specific graph is generated, edge can be used to represent the link between vertices in the graph, and the graph structure is also displayed.
Graphloader
Graphloader is used to load and generate graphs in graphx. The most important function is edgelistfile, which is defined as follows.
def edgeListFile( sc: SparkContext, path: String, canonicalOrientation: Boolean = false, minEdgePartitions: Int = 1, edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) : Graph[Int, Int] = { val startTime = System.currentTimeMillis // Parse the edge data table directly into edge partitions val lines = sc.textFile(path, minEdgePartitions).coalesce(minEdgePartitions) val edges = lines.mapPartitionsWithIndex { (pid, iter) => val builder = new EdgePartitionBuilder[Int, Int] iter.foreach { line => if (!line.isEmpty && line(0) != ‘#‘) { val lineArray = line.split("\\s+") if (lineArray.length < 2) { logWarning("Invalid line: " + line) } val srcId = lineArray(0).toLong val dstId = lineArray(1).toLong if (canonicalOrientation && srcId > dstId) { builder.add(dstId, srcId, 1) } else { builder.add(srcId, dstId, 1) } } } Iterator((pid, builder.toEdgePartition)) }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path)) edges.count() logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime)) GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel, vertexStorageLevel = vertexStorageLevel) } // end of edgeListFile
What is pagerankpagerank in the application example? PageRank is a proprietary Google algorithm used to measure the importance of a specific webpage relative to other webpages in the search engine index. It was invented by Larry Page and Sergey Brin in the late 1990s S. PageRank implements the concept of link value as a ranking factor. PageRank regards the page link as a vote, indicating the importance. Core Idea of PageRank
"On the Internet, if a Web page is linked by many other web pages, it indicates that it is widely recognized and dependent, then it ranks very high. "(From the beauty of mathematics Chapter 10th)
You said it was too simple. Didn't you tell me the same thing? How can we use mathematics to express it?
Well, I thought so at first, and then I understood it a little bit after reading it several times. The analysis steps are described in the following text,
- The relationship between a webpage and a webpage is expressed in graphs.
- The connection relationship between webpage a and webpage B indicates the possibility (probability) that any user is transferred from webpage a To webpage B)
- The ranking of all webpages is represented by a one-dimensional vector B.
The connections between all webpages are represented by matrix A, and the ranking of all webpages is represented by B.
How to parallelize PageRank
Okay, the above mathematical explanation explains"The Calculation of web page ranking can be abstracted as matrix multiplication", And it was proved at the beginning.Parallel processing of Matrix Multiplication.
The theoretical research is over, and the next step is the engineering implementation. Using the Pregel model, the main functions defined in PageRank are as follows.
Vertexprogram
def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr val newPR = oldPR + (1.0 - resetProb) * msgSum (newPR, newPR - oldPR) }
Sendmessage
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { if (edge.srcAttr._2 > tol) { Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) } else { Iterator.empty } }
Messagecombiner
def messageCombiner(a: Double, b: Double): Double = a + b
A little inspiration
Through the PageRank example, we can figure out how to use the mathematical theory of plain learning to solve practical problems.
"What you learn is always valuable. If you cannot use it, it depends on creation"
Complete code
// Connect to the Spark clusterval sc = new SparkContext("spark://master.amplab.org", "research")// Load my user data and parse into tuples of user id and attribute listval users = (sc.textFile("graphx/data/users.txt") .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))// Parse the edge data which is already in userId -> userId formatval followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")// Attach the user attributesval graph = followerGraph.outerJoinVertices(users) { case (uid, deg, Some(attrList)) => attrList // Some users may not have attributes so we set them as empty case (uid, deg, None) => Array.empty[String]}// Restrict the graph to users with usernames and namesval subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)// Compute the PageRankval pagerankGraph = subgraph.pageRank(0.001)// Get the attributes of the top pagerank usersval userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) { case (uid, attrList, Some(pr)) => (pr, attrList.toList) case (uid, attrList, None) => (0.0, attrList.toList)}println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
Summary
This article emphasizes that spark is a distributed parallel computing framework. Whether spark can be used depends on the Mathematical Model of the problem. If it can be processed in parallel, it cannot be used properly.
Let's take a look at the mathematical knowledge mentioned in another example.
Again strongly recommendedMathematical Bridge
References
- The beauty of Mathematics
- Mathematical Bridge: Watching the advanced mathematics
- Big Data