Keras Chinese document note 16--using pre-trained word vectors

Source: Internet
Author: User
Tags keras
What is a word vector?

"Word vector" (word embedding) is a kind of natural language processing technology that maps the semantics of words into vector space. A word is represented by a specific vector, and the distance between the vectors (for example, the L2 paradigm distance between any two vectors, or the more commonly used cosine distance), partly characterizes the semantic relationship between words. The geometric space formed by these vectors is called an embedded space.

Ideally, in a good embedding space, the "path" vectors from the "kitchen" vector to the "dinner" vector will accurately capture the semantic relationship between the two concepts. In this case, the "path" vector represents "the place of occurrence", so you will expect the "kitchen" vector-the "dinner" vector (the difference between the two word vectors) to capture the semantic relationship of the "place of occurrence". Basically, we should have a vector equation: Dinner + the place of occurrence = kitchen (at least close). If that's the case, then we can use such a relationship vector to answer some questions. For example, to apply this semantics to a new vector, such as "work," we should get a meaningful equation, work + where it happened = office, to answer "where the work happened." ”。

Word vectors characterize the co-occurrence of words in text datasets through dimensionality reduction techniques. Methods include neural networks ("Word2vec" techniques), or matrix decomposition. GloVe Word vector

This article uses the glove word vector. GloVe is the abbreviation for the "Global Vectors for Word representation", a word vector based on co-existing matrix decomposition. The glove vector used in this article was trained on the English Wikipedia in 2014, with 400k of different words, each with a 100-dimensional vector. newsgroup DataSet

The data set used in this article is the famous "newsgroup DataSet". There are 20 types of news text data in this dataset, and we will implement the text categorization task for that dataset. Experimental Methods

Here are the steps we can take to solve the classification problem: converting all the news samples into a word index sequence. The so-called Word index is to assign an integer ID to each word in turn. Traversing all the news texts, we keep only the 20,000 words we see most, and each news text retains a maximum of 1000 words. Generates a word vector matrix. Column I is a word vector that represents the word index for I. Load the word vector matrix into the Keras embedding layer, set the weight of the layer can not be trained (that is, in the course of network training, the word vector will no longer change). The Keras embedding layer is then connected to a 1D convolution layer and outputs a news category with a Softmax full connection.

using pre-trained word vectors as a feature is very effective. in general, in natural language processing tasks, when the number of samples is very young, the use of pre-trained word vectors is feasible (in fact, the pre-trained word vectors introduce external semantic information, often useful for the model). Word2vec and Glove

Domestic Rachel-zhang used Sklearn to do experiments on the same data set based on traditional machine learning algorithms. At the same time, Richard Socher in the paper that put forward the glove word vector, the glove word vector is better than the Word2vec performance. After the study shows that Word2vec and glove in fact, for example, Schnabel, such as the evaluation of the word vector indicators, assessment shows that most of the evaluation indicators Word2vec in glove and c&w word vectors. This article can actually use the Word2vec word vector of Google News to do a set of test experiment.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.