LDA optimizes gibbslda++-0.2 using MPI

Source: Internet
Author: User
Tags mixed

MPI is the abbreviation for "Message passing Interface", which is often used for concurrent programming of single-threaded multithreading.


1. The gibbslda++ training framework is broadly as follows:

Loop: The training process iterates n times
{
     loops: Iterates through each training sample (referred to as doc)
     {loop: Iterates
           through each word {loop in the Training sample
           : The
                  Gibbs sampling process, traversing each of the topic
                  {
                         update:
                         doc-topic matrix,
                         word-topic matrix,
                         doc-topic edge distribution vector,
                         word-topic edge distribution vector
                         The current distribution vector for Word on Topic}}}}

LDA's training process is mainly done through the Gibbs sampling process. It is stored in memory: (1) All training samples, (2) doc-topic matrix, Word-topic Matrix, several topic-related distribution vectors, etc. The entire training process consists of a four-cycle nested above. When the dictionary is small and the sample collection is not large, Gibbs sampling is a CPU-intensive project, not a memory-intensive project. This means that the CPU is the bottleneck of the whole training.


2. Try to use OpenMP to implement the single-machine parallelization of the training process

The main use of OpenMP is to take advantage of multiple cores on a single CPU, dividing the loop into the number of cores, for example, dividing a cycle into smaller two loops, placing them on the CPU's two cores, and then merging the resulting results. There is a requirement to use this: data independence. That is, after dividing into two loops, the two loops do not modify the same variable or memory area, and if there is one, there will be a "competitive" relationship, because the time of writing memory is different and the program runs out different results.

2.1 Loop with OpenMP in the most inner layer

The main is to calculate the "current word on the topic distribution vector," The time to traverse the topic, when word on the topic distribution vector can be divided into different regions according to the topic, each other does not interfere. The code is as follows:

#pragma omp parallel for num_threads (2) for
    (int k = 0; k < K; k++) 
	{
		P[k] = (Nw[w][k] + Beta)/(Nwsum[k] + Vbeta) *
		    (Nd[m][k] + Alpha)/(Ndsum[m] + kalpha);
    }

The code above uses "#pragma omp parallel for num_threads (2)" To specify that the loop is operated on two cores.

For example, the test results are as follows: the training samples brought by gibbslda++.

It's not MPI, Gibbs. Sampling time consumption: 62.657s

With MPI, Gibbs sampling time consumption: 94.626s

With MPI, the efficiency is even lower.

The reason that can be thought of now is that the cycle of K (topic) is too small (the value of K in the experiment is 100), and it is in the most inner loop and is called most times. This causes a large number of operations that split the K-loop to 2 CPUs, which also have performance losses, and the loss is greater than the gain of independent operation after splitting, resulting in performance degradation.

2.2 Using OpenMP in the second loop (traversing the training sample)

The second loop is the traversal of the training sample, in which OpenMP is actually dividing the training samples and training them separately. The code is as follows:

#pragma omp parallel for num_threads (2) for
		(int m = 0; m < m; m++) 
		{for
			(int n = 0; n < ptrndata->d ocs[m]->length; n++) 
			{
				int topic = Sampling (M, n);
				Z[m][n] = topic;
			}
		}

With this modification, the run time of the Gibbs sample is much lower, and the result is as follows:

It's not MPI, Gibbs. Sampling time consumption: 62.657s

With MPI, Gibbs sample time consumption: 40.068s reduces training time by 1/3.

However, there is a problem with data dependencies: by dividing the second loop into two parallel loops, the two parallel loops, the inner loop, will do a Gibbs sample of Word, which reads and updates the doc-topic and word-topic matrices, and those edge distributions. From this point of view, the result that the algorithm runs out should be incorrect results. However, in the other angle to think:

(1) The data is parallelized, that is, the doc set is divided, then the Doc-topic matrix and Doc Edge distribution is divided into multiple parts, non-interference. However, the word-topic matrix and the vectors of the current word's distribution on each topic will interfere with each other.

(2) The contents of the word-topic matrix, each word is different, the updated area is different, when the parallel loop calculation of Word is not affected, when the parallel loop calculates the same word, just need to accumulate the results of the calculation, also OK

(3) The vector of the distribution of word on each topic, this result will be affected. For example, if the training data is divided into two parts, the two copies found two word (usually different), to update the above vector, and this vector write operation if not protected, will write "mixed". However, if the whole situation is such a state of mixed-writing .... The results are not.

(4) Gibbs sampling is a probabilistic algorithm, has been verified, even if the parameters are the same, the results of each run will be different.


Finish.


Reprint Please specify source: http://blog.csdn.net/xceman1997/article/details/46582637

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.