Keras Chinese document note 16--using pre-trained word vectors

Last Update:2018-07-26 Source: Internet

Author: User

Tags keras

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a word vector?

"Word vector" (word embedding) is a kind of natural language processing technology that maps the semantics of words into vector space. A word is represented by a specific vector, and the distance between the vectors (for example, the L2 paradigm distance between any two vectors, or the more commonly used cosine distance), partly characterizes the semantic relationship between words. The geometric space formed by these vectors is called an embedded space.

Ideally, in a good embedding space, the "path" vectors from the "kitchen" vector to the "dinner" vector will accurately capture the semantic relationship between the two concepts. In this case, the "path" vector represents "the place of occurrence", so you will expect the "kitchen" vector-the "dinner" vector (the difference between the two word vectors) to capture the semantic relationship of the "place of occurrence". Basically, we should have a vector equation: Dinner + the place of occurrence = kitchen (at least close). If that's the case, then we can use such a relationship vector to answer some questions. For example, to apply this semantics to a new vector, such as "work," we should get a meaningful equation, work + where it happened = office, to answer "where the work happened." ”。

Word vectors characterize the co-occurrence of words in text datasets through dimensionality reduction techniques. Methods include neural networks ("Word2vec" techniques), or matrix decomposition. GloVe Word vector

This article uses the glove word vector. GloVe is the abbreviation for the "Global Vectors for Word representation", a word vector based on co-existing matrix decomposition. The glove vector used in this article was trained on the English Wikipedia in 2014, with 400k of different words, each with a 100-dimensional vector. newsgroup DataSet

The data set used in this article is the famous "newsgroup DataSet". There are 20 types of news text data in this dataset, and we will implement the text categorization task for that dataset. Experimental Methods

Here are the steps we can take to solve the classification problem: converting all the news samples into a word index sequence. The so-called Word index is to assign an integer ID to each word in turn. Traversing all the news texts, we keep only the 20,000 words we see most, and each news text retains a maximum of 1000 words. Generates a word vector matrix. Column I is a word vector that represents the word index for I. Load the word vector matrix into the Keras embedding layer, set the weight of the layer can not be trained (that is, in the course of network training, the word vector will no longer change). The Keras embedding layer is then connected to a 1D convolution layer and outputs a news category with a Softmax full connection.

using pre-trained word vectors as a feature is very effective. in general, in natural language processing tasks, when the number of samples is very young, the use of pre-trained word vectors is feasible (in fact, the pre-trained word vectors introduce external semantic information, often useful for the model). Word2vec and Glove

Domestic Rachel-zhang used Sklearn to do experiments on the same data set based on traditional machine learning algorithms. At the same time, Richard Socher in the paper that put forward the glove word vector, the glove word vector is better than the Word2vec performance. After the study shows that Word2vec and glove in fact, for example, Schnabel, such as the evaluation of the word vector indicators, assessment shows that most of the evaluation indicators Word2vec in glove and c&w word vectors. This article can actually use the Word2vec word vector of Google News to do a set of test experiment.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More