Apache Spark Source code reading 14 -- graphx Implementation Analysis

Last Update:2014-07-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You are welcome to reprint it. Please indicate the source, huichiro.

Summary

The parallel processing of graphs has always been a very hot topic. There are two important topics here: first, how to parallelize graph algorithms, and second, how to find a suitable framework for parallel processing. As a very good parallel processing framework, spark is a natural task to move some parallel algorithms to it.

Graphx is a parallel implementation of some common graph algorithms on spark. It also provides a wide range of APIs. This article provides a preliminary study on the Code architecture of graphx and the specific implementation of PageRank in graphx.

Why did Google win the search engine war?

When Google was still in its infancy, Yahoo was booming in the search engine field. Obviously, there is a wall in front of Google that makes people have almost no hope.

However, the world is unpredictable. Now, "Foreign Affairs ask Google" has become an indisputable fact, and Yahoo should also accompany the customer.

One factor behind the formation of such conversions is that Google has invented the PageRank algorithm that significantly improves search accuracy. It is no exaggeration to say that the proposal of PageRank makes Google firmly stick to the search engine competition.

The search engine has several key factors to consider (personal opinion ).

To attract users, you must have excellent search accuracy.
With the help of users, advertising can be launched, and targeted advertising can be improved to make profits.

The above two aspects have excellent algorithms.

Return to the topic. PageRank is a specific application of graph theory.

Graph Theory

Graph theory is a very important part of discrete mathematics. Below is a undirected connected graph.

Vertex)

A, B, C, D, and E in a graph are called vertices.

Edge

The link between a vertex and a vertex is called an edge.

Mathematical representation of an image

When I was in college, I never wanted to understand why I had to learn the linear algebra of Lao shizi. Until these two daysThe beauty of MathematicsWhen I wrote a book, I realized that linear algebra is indispensable in some computer application fields.

We can easily understand the plane and three-dimensional ry (one is two-dimensional and the other is three-dimensional), while linear algebra solves a high-dimensional problem because it cannot be intuitively felt, so it is difficult. If you want to understand why there are so many branches in mathematics and their internal associations, we strongly recommend that you read them.A viewing tour of mathematics Bridge on advanced mathematics.

In mathematics, what is used to represent a graph? The answer is the matrix in a linear algebra. Think about it, the association matrix of the graph, and the adjacent matrix of the graph. In short, linear algebra is a matrix. The following is a specific example.

Parallel Graph Processing

As we mentioned earlier, graphs can be expressed in a matrix. To some extent, the parallelization problem of graphs is converted to the parallelization problem of matrix operations.

Take matrix multiplication as an example to see if it can be processed in parallel.

Take matrix A x B as an example to describe the parallel processing process.

Divide the preceding matrices A and B into four parts, as shown in

After the first alignment

Child Matrix Multiplication

After multiplication, the sub-matrix of A is moved to the left, and the sub-matrix of B is moved up.

Merge computing results

Parallel Graph Processing Framework, starting with Pregel

There are two key points in the previous section:

Graphs are represented by matrices. Operations on graphs are matrix operations.
The Matrix Multiplication operation can be parallelized to dynamically demonstrate its principles

You said OK, I understand. Which of the following is a suitable Parallel Processing framework that can be used for graph computing? You must have thought of mapreduce.

Although mapreduce is also a good parallel processing framework, it has many disadvantages in graph computing, mainly because the intermediate computing process needs to be stored on the hard disk, which is very inefficient.

Google specially proposed a remarkable framework Pregel for Parallel Graph processing. The dynamic view during execution is as follows.

Pregel has the following advantages:

Cascade scalability
Highly fault tolerant
Common algorithms that can represent various graphs

Pregel Computing Model

Shows the computing model. There are three important

Processing logic for each vertex vertexprogram
Message sending, used for communication between adjacent nodes sendmessage
Message merging logic messagecombining

Implementation of Pregel in spark

Thank you very much for sticking to the fact that this blog has a lot of content and is a little difficult. I want to extend the logic of the previous graph.

This graph indicates this,Graphx utilizes a parallel processing framework like spark to implement some parallel execution algorithms on the graph.

This blog post will express the words highlighted above. Please read them carefully.

Whether the algorithm can be parallelized is irrelevant to spark.
Whether an algorithm is parallelized or not must be proved by mathematics.
It is wrong to use spark to implement the proved parallelization algorithm, because graphx supports the Pregel graph computing model.

Graph, an important concept in graphx

Undoubtedly, a graph itself is a very important concept in graphx.

Member variables

The important member variables in graph are

Vertices
Edges
Triplets

Why should we introduce triplets? It is mainly related to the Pregel computing model. In triplets, the specific code of edge and vertex is recorded simultaneously.

Member Functions

Function categories

Operations on all vertices or edges without changing the graph structure, such as mapedges and mapvertices
Subgraph, similar to the filter subgraph in the set operation.
Graph splitting, that is, the paritition operation, is critical for spark computing. It is precisely because different partition operations make parallel processing possible.PartitionstrategyThe benefits are different. Hash is used to divide the entire graph into multiple regions.
Outer Join Operation of outerjoinvertices Vertex

Graph operations and operations graphops

The common algorithms of graphs are abstracted to the graphops class in a centralized way. They are implicitly converted to graphops in graph.

implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag]      (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops

The following operations are supported:

Collectneighborids
Collectneighbors
Collectedges
Joinvertices
Filter
Pickrandomvertex
Pregel
PageRank
Staticpagerank
Connectedcomponents
Trianglecount
Stronglyconnectedcomponents

RDD

RDD is the core of the spark system. What new RDD is introduced in graphx?

Vertexrdd
Edgerdd

Compared with edgerdd, vertexrdd is more important, and there are many operations on it, mainly concentrated on vertex attribute merging. When it comes to merging, it has to be pulled.Relational algebra and Set TheorySo in vertexrdd, we can see many terms similar to SQL, such

Leftjoin
Innerjoin

As for the difference between leftjoin, innerjoin, and outerjoin, we recommend that you google it.

Storage and loading of graphx scene analysis Graphs

When performing mathematical calculations, the graph is represented by a matrix in linear algebra. How can this problem be stored?

When I learned the data structure, the teacher must have said a lot of methods, so I will not be so embarrassed.

However, in the big data environment, what if the graph is huge and the data indicating the vertex and edge is not enough to be placed in a file?Use HDFS

When loading, the memory of a machine is insufficient. What should I do?Delayed loading: when you really need data, you can distribute the data to different machines and adopt cascade.

In general, we will save all vertex-related content in a file vertexfile, and all edge-related information will be saved in another file edgefile.

When a specific graph is generated, edge can be used to represent the link between vertices in the graph, and the graph structure is also displayed.

Graphloader

Graphloader is used to load and generate graphs in graphx. The most important function is edgelistfile, which is defined as follows.

def edgeListFile(      sc: SparkContext,      path: String,      canonicalOrientation: Boolean = false,      minEdgePartitions: Int = 1,      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)    : Graph[Int, Int] =  {    val startTime = System.currentTimeMillis    // Parse the edge data table directly into edge partitions    val lines = sc.textFile(path, minEdgePartitions).coalesce(minEdgePartitions)    val edges = lines.mapPartitionsWithIndex { (pid, iter) =>      val builder = new EdgePartitionBuilder[Int, Int]      iter.foreach { line =>        if (!line.isEmpty && line(0) != ‘#‘) {          val lineArray = line.split("\\s+")          if (lineArray.length < 2) {            logWarning("Invalid line: " + line)          }          val srcId = lineArray(0).toLong          val dstId = lineArray(1).toLong          if (canonicalOrientation && srcId > dstId) {            builder.add(dstId, srcId, 1)          } else {            builder.add(srcId, dstId, 1)          }        }      }      Iterator((pid, builder.toEdgePartition))    }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))    edges.count()    logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime))    GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,      vertexStorageLevel = vertexStorageLevel)  } // end of edgeListFile

What is pagerankpagerank in the application example? PageRank is a proprietary Google algorithm used to measure the importance of a specific webpage relative to other webpages in the search engine index. It was invented by Larry Page and Sergey Brin in the late 1990s S. PageRank implements the concept of link value as a ranking factor. PageRank regards the page link as a vote, indicating the importance. Core Idea of PageRank

"On the Internet, if a Web page is linked by many other web pages, it indicates that it is widely recognized and dependent, then it ranks very high. "(From the beauty of mathematics Chapter 10th)

You said it was too simple. Didn't you tell me the same thing? How can we use mathematics to express it?

Well, I thought so at first, and then I understood it a little bit after reading it several times. The analysis steps are described in the following text,

The relationship between a webpage and a webpage is expressed in graphs.
The connection relationship between webpage a and webpage B indicates the possibility (probability) that any user is transferred from webpage a To webpage B)
The ranking of all webpages is represented by a one-dimensional vector B.

The connections between all webpages are represented by matrix A, and the ranking of all webpages is represented by B.

How to parallelize PageRank

Okay, the above mathematical explanation explains"The Calculation of web page ranking can be abstracted as matrix multiplication", And it was proved at the beginning.Parallel processing of Matrix Multiplication.

The theoretical research is over, and the next step is the engineering implementation. Using the Pregel model, the main functions defined in PageRank are as follows.

Vertexprogram

def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = {      val (oldPR, lastDelta) = attr      val newPR = oldPR + (1.0 - resetProb) * msgSum      (newPR, newPR - oldPR)    }

Sendmessage

def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {      if (edge.srcAttr._2 > tol) {        Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))      } else {        Iterator.empty      }    }

Messagecombiner

def messageCombiner(a: Double, b: Double): Double = a + b

A little inspiration

Through the PageRank example, we can figure out how to use the mathematical theory of plain learning to solve practical problems.

"What you learn is always valuable. If you cannot use it, it depends on creation"

Complete code

// Connect to the Spark clusterval sc = new SparkContext("spark://master.amplab.org", "research")// Load my user data and parse into tuples of user id and attribute listval users = (sc.textFile("graphx/data/users.txt")  .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))// Parse the edge data which is already in userId -> userId formatval followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")// Attach the user attributesval graph = followerGraph.outerJoinVertices(users) {  case (uid, deg, Some(attrList)) => attrList  // Some users may not have attributes so we set them as empty  case (uid, deg, None) => Array.empty[String]}// Restrict the graph to users with usernames and namesval subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)// Compute the PageRankval pagerankGraph = subgraph.pageRank(0.001)// Get the attributes of the top pagerank usersval userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {  case (uid, attrList, Some(pr)) => (pr, attrList.toList)  case (uid, attrList, None) => (0.0, attrList.toList)}println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))

Summary

This article emphasizes that spark is a distributed parallel computing framework. Whether spark can be used depends on the Mathematical Model of the problem. If it can be processed in parallel, it cannot be used properly.

Let's take a look at the mathematical knowledge mentioned in another example.

Again strongly recommendedMathematical Bridge

References

The beauty of Mathematics
Mathematical Bridge: Watching the advanced mathematics
Big Data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark Source code reading 14 -- graphx Implementation Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Spark Source code reading 14 -- graphx Implementation Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support