(deep) Neural Networks (deep learning), NLP and Text Mining
Recently flipped a bit about deep learning or common neural network in NLP and text mining aspects of the application of articles, including Word2vec, and then the key idea extracted out of the list, interested can be downloaded to see:
Http://pan.baidu.com/s/1sjNQEfz
I did not put some of my own ideas into the inside, we have views, a lot of communication.
Here is a brief summary of some of these paper:
- Bengio, Yoshua, Réjeanducharme, Pascal Vincent, and Christian Jauvin. "A Neural probabilistic Language Model." Journal of Machine Learning 3 (2003): 1137-1155.
One of the important work of neural network speech model, the work behind the Word2vec is actually a kind of optimization to him. Because the complexity of the SOFTMAX process in this model is too high, computing the conditional probabilities of a word requires calculating the conditional probabilities of each word in the dictionary in order to do normalization. So there's a lot of paper, including Word2vec, to optimize it.
- Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic Neural network language model." In aistats, vol. 5, pp. 246-252. 2005.
- Mnih, Andriy, and Geoffrey E. Hinton. "A Scalable Hierarchical distributed language model." In advances in Neural information processing Systems, pp. 1081-1088. 2009.
These two paper, an article Yoshua Bengio, a Geoffrey Hinton, are from the hierarchical angle to reduce the original model time complexity. http://blog.csdn.net/mytestmy/article/details/26969149 This paper is a good explanation.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "efficient estimation of word representations in vector space." ICLR (2013).
Use a Huffman tree to do hierarchical Softmax, because Word encoding is shorter, so you can reduce the number of matrix multiplication operations. How to use a tree to represent P (W|context (w)) to see, it is simple, the left and right branches are considered a two classification problem, then the probability of being divided to the right can be defined asσ(θTx) , then the probability on the right is 1−σ(θTx), the final conditional probability is to multiply along the branch. In this paper, two models of Cbow and Skip-gram are presented, with different structures, similar solutions, maximum likelihood, derivative, sgd.
When to use Cbow, when to use Skip-gram? The experimental results show that Skip-gram is better in Word semantic related task, Cbow is better in syntactic.
- Mikolov, Tomas, et al. "distributed representations of words and phrases and their compositionality." NIPS (2013).
Is the continuation of the optimization of the paper, the original problem directly expressed as a two classification problem, negative sampling is how to take negative samples, the method is very simple, is a bet, the author said that the method should be the project to achieve the time of trick, higher efficiency, You can use the map to store random numbers into Word mappings.
Mikolov, Tomas has gone to Facebook, and Joey says he's a very smart guy.
Word2vec has a lot of useful properties, one of which is naturally a vector, based on which you can do things like machine translation:
- Mikolov, Tomas, Quoc v. Le, and Ilyasutskever. "exploiting similarities among languagesfor machine translation." ArXiv preprint arxiv:1309.4168 (2013).
The corpus of each language is trained in Word2vec model, English and Spanish for example, in two vector spaces, the vector dimensions can be the same or different. The goal is simple, learn a mapping matrix optimization| | Wx− y| |2 Can. What is the matrix? is a linear transformation bar.
Another attribute of Word2vec is interesting, V (king) –v (Queen) + V (woman) ≈v (man). There are a lot of things you can do with this property, such as:
- Fu, Ruiji, Jiang Guo, Bing Qin, Wanxiangche, Haifeng Wang, and Ting Liu. "Learning semantic hierarchies via word embeddings." ACL, 2014.
Word2vec, after word, learns the semantic upper and lower relationships of their hypernym-hyponym . The method is very simple, but the article first uses a cluster, is also quite make sense. The y-x obtained vectors are k-means clustered, meaning that the relationship in cluster should be similar.
Each cluster learns the mapping matrix W of a word relationship. The Word2vec Skip-gram model used in this article represents Word feature.
With regard to CNN's application in sentences, there are several paper, first:
- Blunsom, Phil, Edward Grefenstette, and Nalkalchbrenner. "A convolutional neural Network for modelling sentences." ACL 2014.
Works from Blunsom. CNN to the sentence above the process is very simple, CNN's volume base, in the sentence as a window, you can put a few words convolution together, so that word order, Word context into account.
CNN also has the pooling link, the above paper the author proposed a kind of k-max pooling, not only takes a maximum value.
- Kim, Yoon. "convolutional neural Networks for sentence classification." arxiv:2014
The thought of this paper is relatively simple, nothing to say.
sentence Classification related task Many Ah, such as sentiment classification and so on.
- Zeng, Daojian, Kang Liu, Siwei Lai, Guangyou Zhou, and June Zhao. "Relation classification via convolutional deep neural Network." COLING Best Paper
Coling this year's best paper, in order to describe the relationship between word, extracted a lot of features. Which extracts the sentence features used by CNN. Convolution, Pooling,softmax, just a few processes.
- Le, Quoc v., and Tomas Mikolov. "distributed representations of sentences and Documents." ICML (2014).
Extension of the Word2vec model.
In fact, we all feel that the deep model is able to extract images and other signals of the latent variables, then it should be very natural to extract the text topic out, LDA and so on is nothing more than a dictionary size distribution to describe the Topic,deep Model should be able to use a shorter vector to describe the latent topic, of course, there are similar work, here first list:
Wan, Li, Leo Zhu, and Rob Fergus. "A Hybrid Neural network-latent topic model." Icais. 2012.
Larochelle, Hugo, and Stanislas Lauly. "A Neural autoregressive topic model." NIPS. 2012.
Hinton, Geoffrey E., and Ruslan Salakhutdinov. "replicated Softmax:an undirected topic model." NIPS 2009.
Srivastava, Nitish, Ruslan R. Salakhutdinov, and Geoffrey E. Hinton. "Modeling documentswith deepBoltzmann machines." Uai (2013).
Hinton, Geoffrey, and Ruslan Salakhutdinov. "Discovering binary codes for documents by learning deep generative models." Topics in Cognitive Science 3, No. 1 (2011): 74-91.
Salakhutdinov, Ruslan, Joshua B. Tenenbaum, and Antonio Torralba. "learning to learn with compound HDmodels." NIPS (2011).
(deep) Neural Networks (deep learning), NLP and Text Mining