Classic Statistical language model

Last Update:2015-06-02 Source: Internet

Author: User

Tags in degrees

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HAL, LSA, with coals

This paper introduces three classical statistical language models, Hal,lsa, and coals.

Pat the head thinking, how can you express a word?
1. Step-by-step increments
e.g the Lily
Lilies < Flowers < Plants < objects
2. Synonyms
E.g said Good.
Good, good, OK, good, great ...

This representation brings the problem:

For adjectives, synonyms cannot be expressed in degrees.
Definition of no new words
Subjectivity
Difficult to quantify the similarity of words

In order to solve this problem, in 1957, Firth proposed a common thought in the statistical NLP, using a word in the sentence of neighborhood to express the word. Specifically,

Hyperspace Analogue to Language method (HAL)
Hal (Lund & Burgess, 1996) methods can be used with a co-occurrence matrix, representing any two words Correlation. Co-occurrence matrix result for a window size=1:

Here window size refers to the calculation scope. Window size=5, for example, indicates that 5 words are scoped to a word, and weight decreases from 5 to 1 as the distance between adjacent words increases. According to co-occurrence Matrix, each word has a vector representation, which can then be used to denote the similarity of any two words using the reciprocal of the Euclidean distance, or cosine, or correlation coefficients.
But there are several issues:
- as the vocabulary increases, the size of the matrix grows, and the memory-consuming
- matrix is very sparse, and the sparse model needs to be considered for the corresponding classification problem.
So we thought, could we drop into the low dimension and form a dense co-occurrence matrix X ?
Latent Semantic analysis (LSA)
LSA (Deerwester et al., 1990; Landauer, Foltz, & Laham, 1998), Co-occurrence matrix is a word-document matrix that represents the frequency of a word appearing in a document, After statistics, it is normalization (refer to entropy normalization),

Use the original value to take the log and divide by the entropy. In this case, if a word appears evenly in the document, then its entropy is large, then the normalized weight is relatively small, indicating that we are not interested in these words.
After data preprocessing, we co-occurrence matrix SVD decomposition C=usv^t, take the largest k-dimensional singular value to do eigenvalues, respectively, take the corresponding U K line to do word representing vector, The K-line of V is the representing vector of document. Because the matrix is sparse, the SVD is not too slow even if the matrix is large.
Coals (Rohde et al., 2009)
The HAL made a small change, the HAL obtained co-occurrence matrix correlation Normalization,

Then, due to the negative correlation of unreliability, all negative correlation zeros are obtained from the new co-occurrence matrix. The experiment proves that the data cleaning is better and satisfies the high dimensional sparsity, and can be used for fast SVD. This is the picture in the article, the clustering effect is also good:

Reference documents:
1. Hal:lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence[j]. Behavior Methods, Instruments, & Computers, 1996, 28 (2): 203-208.
2. Lsa:deerwester s C, Dumais s T, Landauer t K, et al indexing by latent semantic analysis[j]. Jasis, 1990, 41 (6): 391-407.
3. Coals:douglas L. Rohde, Laura M. Gonnerman, and David c. Plaut. A improved model of semantic similarity based on lexical co-occurrence. Cognitive science. sumitted.

Classic Statistical language model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More