The classification problem in matrix operations and text processing comes from Google researcher Wu Jun

Source: Internet
Author: User
When I was studying linear algebra in college, I couldn't think of any other purpose than how to solve linear equations. Many concepts about matrices, such as feature values, are separated from everyday life. Later, I learned a lot of matrix approximation in numerical analysis. Algorithm , You still cannot see where the application is available. These courses were chosen at that time for a mixed credit degree. I think many of you have had similar experiences. It was not until that I studied natural language processing for a long time that I found that mathematicians put forward the concepts and algorithms of those matrices, which have practical application significance.

In natural language processing, the two most common categories of classification problems are classifying texts by subject (for example, classifying all the news about the Asian Games to sports) and classify words in the vocabulary by meaning (for example, classify names of various sports into one type ). Both classification problems can be solved satisfactorily and simultaneously through matrix operations. To illustrate how to use the matrix tool class to solve these two problems, let's first review our methods in cosine theorem and news classification..

MinuteThe key to the class is computing relevance. We first calculate the content words of the two texts, or the vectors of the real words, and then obtain the angle between the two vectors. News is related when the angle between the two vectors is zero. News is irrelevant when they are vertical or orthogonal. Of course, the cosine of the angle is equivalent to the Inner Product of the vector. Theoretically, this algorithm is very good. However, the computing time is particularly long. Usually, what we want to dealArticleThere are a large number of articles, at least one million articles, and the secondary record is very long. For example, there are 0.5 million words (including product names of personal names and place names ). If you want to find out all articles on the same topic by comparing the two articles in 1 million articles in pairs, You need to compare the articles in 500 billion. Currently, computers can compare up to one thousand pairs of articles in one second. It takes 15 years to compare the relevance of these 1 million articles. Note that the above calculation must be repeated to truly complete the classification of the article.

In text classification, another method is to use Singular Value Decomposition (SVD) in matrix operations ). Now let's take a look at how Singular Value Decomposition is going on. First, we can use a large matrix A to describe the associations between these 1 million articles and 0.5 million words. In this matrix, each row corresponds to an article, and each column corresponds to a word.

In the preceding figure, M = 1,000,000, n = 500,000. Row I and column J are the elements of the weighted Word Frequency (for example,TF/IDF). Readers may have noticed that this matrix is very large, with 1 million multiplied by 0.5 million, that is, 500 billion elements.

The Singular Value Decomposition is to divide the preceding large matrix into three small matrices and multiply them, as shown in. For example, the matrix in the above example is decomposed into a matrix X with 1 million multiplied by one hundred, a matrix B with one hundred multiplied by one hundred, and a matrix Y with one hundred multiplied by 0.5 million. The total number of elements in these three matrices is only 0.15 billion, which is only 1/3000 of the original number. The corresponding storage capacity and computing workload are smaller than three orders of magnitude.

The three matrices have very clear physical meanings. Each row in the first matrix X represents a type of words related to the meaning, and each non-zero element represents the importance (or correlation) of each word in this type of words. The greater the value, the more relevant. Each column in The Last matrix Y represents an article of the same topic, and each element represents the relevance of each article in this article. The matrix in the middle represents the correlation between the class words and the article thunder. Therefore, we only need to perform a Singular Value Decomposition on correlated matrix A, and W can complete both the synonym classification and the document classification. (Get the correlation between each type of article and each type of word ).

the only problem left now is how to use a computer to perform Singular Value Decomposition. At this time, many concepts in linear algebra, such as matrix feature values, and various numerical analysis algorithms are all used. For a long time, Singular Value Decomposition cannot be processed in parallel. (Although Google already has mapreduce and other parallel computing tools, it is difficult to split the Singular Value Decomposition into irrelevant suboperations, even before Google, the advantages of parallel computing cannot be used to break down the matrix .) Recently, Dr. Zhang Zhiwei of Google China and several Chinese engineers and interns have implemented parallel Singular Value Decomposition algorithms. I think this is a contribution of Google China to the world.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.