LDA Background
LDA (hidden Dirichlet distribution) is a topic clustering model, which is one of the most powerful models in the field of topic clustering, and it can classify eigenvector sets by topic through multiple rounds of iterations. At present, it is widely used in the text topic clustering.
LDA has a lot of open source implementations. Currently widely used, can be distributed parallel processing large-scale corpus of Microsoft's Lightlda, Google Plda, Plda+,sparklda and so on. These 3 types of LDA are described below:
Lightlda relies on Microsoft's own implementation of the Multiverso parameter server, which uses MPI or ZEROMQ to send messages at the bottom of the server. The LDA model (Word-topic matrix) is saved by the parameter server, which provides parameter query and update service for the document training process.
Plda and plda+ use MPI message communication to divide the MPI process into Word and doc parts. Doc Process Training document, the word process provides the model query, update functionality for the DOC process.
Spark LDA has two implementations: 1. Based on the Gibbs sampling principle and the version implemented using GRAPHX (that is, the Emldaoptimizer and Distributedldamodel on the Spark document), 2. A version that is implemented based on the variational inference principle (Onlineldaoptimizer and Localldamodel on the spark document). Lightlda,plda, Plda+,spark lda comparison
On the ability to handle the expected size of the library, Lihgtlda is far better than Plda and spark LDA
Tested on 10 servers (8 cores 40GB) cluster size:
Lihgtlda is able to handle hundreds of millions of documents, millions of words, and the ability to train thousands of subjects. This processing power makes it easy for Lihgtlda to train most corpora. Microsoft, which uses a cluster of dozens of machines, can train the Bing search engine to crawl one-tenth of the data.
Compared to Lihgtlda, plda+ is able to handle much smaller scales, with the upper limit being: number of words * topic number (model size) < 500 million. When the corpus size reaches the upper limit, the MPI cluster terminates due to insufficient memory, or because the memory data is switched frequently, the iteration speed is very slow. Although plda+ is sensitive to the number of corpora and the number of subjects trained, it is not very sensitive to the size of the document, and 10 million levels of documentation can be easily resolved with a small number of vocabularies and a smaller number of subjects.
Spark LDA's GRAPHX processing scale metric is the vertex data of the graph, i.e. (number of documents + number of words) * topic number, upper limit is the number of documents * Topic number < 5 billion (because the number of vocabularies is often smaller relative to the number of documents, approximately equal to the number of documents * subject number). When this scale is exceeded, the spark cluster enters a suspended animation state. There are no more nodes to Oom until the task fails.
The spark LDA bottleneck implemented by the variational inference is the number of vocabularies * topics, which is what we call model size, capped at about 100 million. Why is there such a bottleneck? Because during the implementation of the variational inference, the model uses matrix local storage, each partition computes part of the value of the model, and then overlays the matrix reduce on driver. When the model is too large, the memory of the driver node cannot withstand the models sent from each partition.
Convergence speed, Lightlda is much faster than Plda, plda+ and Spark LDA. Small-scale corpus (300,000 documents, 100,000 words, 1000 subjects) test, lightlda:plda+: Spark LDA (GRAPHX) = 1:4:50
Why are various LDA's measures to deal with the size of a corpus different? This is related to the way they are implemented, different LDA has different bottlenecks, we are here to talk about Spark LDA, other LDA follow-up introduction. Spark LDA
The
Spark Machine Learning Library Mllib implements 2 versions of LDA, called Spark EM LDA and Spark Online LDA, respectively. They use the same data input, but the internal implementation and rationale are completely different. Spark EM LDA is implemented using GRAPHX to train the model by manipulating the edges and vertex data of the graph. The Spark Online LDA uses a sampling approach, extracting some document training models at a time and getting the final model through multiple training sessions. On parameter estimation, Spark EM LDA uses the Gibbs sampling principle to estimate the model parameters, and spark Online LDA uses Bayesian variational inference principles to estimate the parameters. On the model store, Spark EM LDA stores the trained topic-word model on the GRAPHX graph vertex, which is a distributed storage method. Spark Online uses a matrix to store the theme-word model, which belongs to the local model. With these differences, you can see the difference between spark EM LDA and Spark Online LDA, and they each have their own bottlenecks. Spark EM LDA Shuffle a considerable amount of time during training, seriously slowing down. While Spark Online LDA uses a matrix storage model, the matrix scale directly limits the number of subjects and words in the training document set. In addition, Spark EM LDA updates the model after each iteration, and spark online LDA updates the model for each sampled text update, so the spark online LDA model is updated more promptly and converges faster. Spark EM Lda GRAPHX Implementation principle
Spark EM LDA estimates parameters based on the Gibbs sampling principle, and most of the LDA training processes that infer parameters based on the Gibbs sampling principle are as follows:
Each word in the LDA document belongs to a theme, and the general idea of the LDA training process is to re-select the theme for each word in each document in a round of iterations, based on the Gibbs sampling formula, detailed principles see parameter estimation for text Analysis of this article.
The core of the LDA implementation algorithm is to re-select the theme for each word in each document. This process Graphx did a clever implementation, it takes the document to the word as the edge, the frequency as the edge data, the corpus is structured into a graph, the corpus of each word operation of each document into the diagram on each side of the operation, and the edge of the RDD processing is the most common in the Graphx processing method.
Graphx the NKM n_m^k, NKT n_t^k matrix is stored on the document vertex and the word vertex, and the frequency information is stored on the edge. It expresses the whole document clustering result matrix, model matrix and corpus word frequency matrix in the graph structure, and the process of LDA algorithm is expressed as the traversal process of the edge. Thanks to the easy modeling of LDA based on Gibbs and the multi-wheel iterative nature of machine learning, Spark is simply and efficiently implemented on GRAPHX, forming Spark MLlib LDA. Spark EM LDA initialization
The input data for Spark LDA is the word frequency matrix rdd[(Long, Vector)], and its storage format is shown in the following table:
To uniformly number document vertices and word vertices, Spark LDA assigns the vertex IDs of document vertices and word vertices. The document vertex ID number is incremented from 0, and the word vertex number is decremented from-1. The morphemes frequency matrix of the above table is converted to the following table:
Spark LDA generates a GRAPHX edge based on the document-to-edge relationship, as shown in the following figure, with the format [(source vertex ID, destination vertex ID, Word frequency)], as shown in the following table.
(0,-1, 2.0), (0,-2, 1.0), (0, -3, 3.0), (0, -4, 4.0)
... (1,-1, 3.0), (1,-2, 0.0), (1, -3, 2.0), (1, -4, 5.0)
... ...
Spark LDA Side Build
Number of documents in M Corpus
V Frequency matrix morphemes number
D m*v frequency matrix
1 for document M from [0,m-1]:
2 for word w in Document M, w from [ -1,-v]:
3 generate a Edge (M, W, D[m][w]) as an element of Edgerdd
It is expected that all documents in the library will be built into rdd[(long, Long, Double)],graphx further indexed in the RDD partition, optimized to form side RDD. Spark LDA vertex Vector construction
GRAPHX uses the Edge Rdd to initialize the vertex rdd.
After the initial start of Spark Lda, the corpus is described as a graph object of Graphx, which has a vertex rdd that includes the document and Word, and an edge Rdd for the document pointing to the word. The vertex Rdd has a k-dimensional topic distribution vector with Word frequency data. Spark LDA Iteration
Pseudo code:
Source:
Val sendmsg:edgecontext[topiccounts, Tokencount, (Boolean, topiccounts)] = + Unit = (edgecontext) = { Compute N_{WJ} gamma_{wjk} val n_wj = edgecontext.attr//E-step:compute GAMMA_{WJK} (smoothed to
Pic distributions), scaled by token count//N_{WJ}. Val scaledtopicdistribution:topiccounts = Computeptopic (edgecontext.srcattr, edgecontext.dstattr, N_k, W, ETA, Alpha) *= N_WJ edgecontext.sendtodst ((False, Scaledtopicdistribution)) Edgecontext.sendtosrc ((False, scale
dtopicdistribution)}//The Boolean is a hack to detect whether we could modify the values In-place. Todo:add zero/seqop/combop option to aggregatemessages. (SPARK-5438) Val mergemsg: ((Boolean, Topiccounts), (Boolean, topiccounts) = = (Boolean, topiccounts) = (M0,
M1) = {Val sum = if (m0._1) {m0._2 + = m1._2} else if (m1._1) {
M1._2 + = M0._2 } else {m0._2 + m1._2} (True, SUM)}//M-step:aggregation computes new n_{k
J}, N_{wk} counts. Val doctopicdistributions:vertexrdd[topiccounts] = graph.aggregatemessages[(Boolean, topiccounts)] (SENDMSG, MergeMs g). Mapvalues (_._2)
The source code sendmsg corresponds to 7-8 lines in the pseudocode, mergemsg corresponds to the pseudo code line 9th.
The 2nd step in the pseudo-code calculates the vector superposition value of all the word vertices, WV. Spark uses the filter operator on the vertex rdd to get the word vertex rdd. Then the values of the vertex rdd are called fold to sum the vertex vectors. The Scala code is implemented as follows:
Graph.vertices.filter (Istermvertex). Values.fold (Bdv.zeros[double] (numtopics)) (_ + = _)
The 第3-9 step in pseudo-code is implemented by Graphx's Aggregatemasseges method. The 第3-8 step, which belongs to the Aggregatemasseges map phase, generates MSG (k-dimensional vectors) from the Edge (Srcid, DV, freq, Dstid, WV) and sends the MSG to both ends of the vertex. The formula for Aggregatemasseges message generation is the 4–5 step in pseudo-code, which is the same as the Gibbs sampling formula based on the 13th step of Lda Gibbs implementation.
The 9th step, which belongs to the aggregatemasseges reduce phase, is the vertex will receive all the MSG overlays. The user-passed message aggregation function mergemsg is a vector addition.
In summary, the Spark LDA iterative process is divided into two processes, (1) calculates the value of all word vertex data overlay-vector WV, (2) calls the Aggregatemasseges method for message generation, sending, aggregation, and (3) vertex aggregating the message as its new vertex data.
The Spark LDA implementation is slightly different from the LDA Gibbs implementation algorithm. The 13th step in the LDA Gibbs implementation algorithm is to remove the current selection theme for the document before selecting the theme for each word, and to update the global Counter NKM n_m^k, NTK n_k^t, NM n_m, and NK N_k after the theme is re-selected. In the LDA implementation algorithm, each word is processed, the document-topic distribution NKM n_m^k, subject-word distribution NTK n_k^t will change, thus affecting the subject selection of the next word. Thus, the LDA Gibbs implementation algorithm is a fine-grained algorithm. In addition, the input of LDA Gibbs algorithm is the vector after the document participle, not the word frequency matrix. As a result, the same words in the document are processed several times.
Spark LDA in the iterative process, vertex vector and word vertex overlay vector WV remain unchanged, equivalent to global counters NKM n_m^k, NTK n_k^t, NM n_m, NK N_k remain constant during the iteration, i.e. document-subject distribution, subject-word distribution remain unchanged. Map, all the words in the document produce a theme to select MSG (k-dimensional vectors), and reduce aggregates (overlays) the MSG to the corresponding document vertices and word vertices. After the iteration is completed, the aggregated vertex vectors are used as the vertex data of the graph to complete the updating of the statistics. Spark LDA Global Counter NK after each round of iteration completion