HAL, LSA, with coals
This paper introduces three classical statistical language models, Hal,lsa, and coals.
Pat the head thinking, how can you express a word?
1. Step-by-step increments
e.g the Lily
Lilies < Flowers < Plants < objects
2. Synonyms
E.g said Good.
Good, good, OK, good, great ...
This representation brings the problem:
- For adjectives, synonyms cannot be expressed in degrees.
- Definition of no new words
- Subjectivity
- Difficult to quantify the similarity of words
In order to solve this problem, in 1957, Firth proposed a common thought in the statistical NLP, using a word in the sentence of neighborhood to express the word. Specifically,
-
Hyperspace Analogue to Language method (HAL)
Hal (Lund & Burgess, 1996) methods can be used with a co-occurrence matrix, representing any two words Correlation. Co-occurrence matrix result for a window size=1:
Here window size refers to the calculation scope. Window size=5, for example, indicates that 5 words are scoped to a word, and weight decreases from 5 to 1 as the distance between adjacent words increases. According to co-occurrence Matrix, each word has a vector representation, which can then be used to denote the similarity of any two words using the reciprocal of the Euclidean distance, or cosine, or correlation coefficients.
But there are several issues:
- as the vocabulary increases, the size of the matrix grows, and the memory-consuming
- matrix is very sparse, and the sparse model needs to be considered for the corresponding classification problem.
So we thought, could we drop into the low dimension and form a dense co-occurrence matrix X ?
Latent Semantic analysis (LSA)
LSA (Deerwester et al., 1990; Landauer, Foltz, & Laham, 1998), Co-occurrence matrix is a word-document matrix that represents the frequency of a word appearing in a document, After statistics, it is normalization (refer to entropy normalization),
Use the original value to take the log and divide by the entropy. In this case, if a word appears evenly in the document, then its entropy is large, then the normalized weight is relatively small, indicating that we are not interested in these words.
After data preprocessing, we co-occurrence matrix SVD decomposition C=usv^t, take the largest k-dimensional singular value to do eigenvalues, respectively, take the corresponding U K line to do word representing vector, The K-line of V is the representing vector of document. Because the matrix is sparse, the SVD is not too slow even if the matrix is large.
- Coals (Rohde et al., 2009)
The HAL made a small change, the HAL obtained co-occurrence matrix correlation Normalization,
Then, due to the negative correlation of unreliability, all negative correlation zeros are obtained from the new co-occurrence matrix. The experiment proves that the data cleaning is better and satisfies the high dimensional sparsity, and can be used for fast SVD. This is the picture in the article, the clustering effect is also good:
Reference documents:
1. Hal:lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence[j]. Behavior Methods, Instruments, & Computers, 1996, 28 (2): 203-208.
2. Lsa:deerwester s C, Dumais s T, Landauer t K, et al indexing by latent semantic analysis[j]. Jasis, 1990, 41 (6): 391-407.
3. Coals:douglas L. Rohde, Laura M. Gonnerman, and David c. Plaut. A improved model of semantic similarity based on lexical co-occurrence. Cognitive science. sumitted.
Classic Statistical language model