The second lecture on deep learning and natural language processing at Stanford University

Last Update:2015-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Second lecture: Simple word vector representation: Word2vec, Glove (easy word vector representations:word2vec, Glove)

Reprint please specify the source and retention link "I love Natural Language processing": http://www.52nlp.cn

This article link address: Stanford University deep Learning and Natural language processing second: Word vector

Recommended Reading materials:

paper1:[distributed representations of Words and phrases and their compositionality]]
Paper2:[efficient estimation of Word representations in Vector Space]
Second lecture slides [slides]
Video of the second lecture

The following is a second related note, mainly from the slides of the course, video and other relevant information.

How to express the meaning of a word (meaning)

The definition of English word meaning (from Webster's Dictionary)

The idea that's represented by a word, phrase, etc.
The idea of a person wants to express by using words, signs, etc.
The idea was expressed in a work of writing, art, etc.

How to express the meaning of a word in a computer

A semantic dictionary such as WordNet is usually used, including the upper Word (is-a) relationship and the synonym set
Panda of the upper word, from the NLTK in the WordNet interface demonstration

Set of synonyms for good

Problems in semantic dictionaries

Semantic dictionary resources are great but there may be some nuances that are missing, such as the exact synonyms: Adept, expert, good, practiced, proficient,skillful?
Will miss some new words, almost impossible to update in time: Wicked, badass, nifty, crack, Ace, Wizard, Genius, Ninjia
Have a certain subjective inclination
Need a lot of manpower and resources
It's hard to calculate the similarity between two words.

One-hot representation

Traditional rule-based or statistical-based natural semantic processing approaches words as an atomic symbol: hotel, conference, walk
In the context of vector space, this is a vector representation of 1 lots of 0: [0,0,0,0,..., 0,1,0,..., 0,0,0]
Dimensions: 20K (Speech) –50k (PTB) –500k (Big vocab) –13m (Google 1T)
This is the "one-hot" said that there is an important problem with this representation is the "lexical gap" phenomenon: Any two words are isolated. There is no relationship between the two words of light from these two vectors:

Distributional Similarity based representations

A lot of knowledge of the word can be learned through the context of a word.

This is a very successful view of modern statistical NLP

How to use context to represent words

Answer: Using the co-existing matrix (cooccurrence matrix) X

2 options: Full text or window length
Word-document's co-existing matrix will eventually get generalized themes (such as sports words will have similar markings), which is shallow semantic analysis (LSA, latent Semantic analyst)
Window length easy to capture syntax (POS) and semantic information

Window-based co-occurrence matrix: A simple example

Window length is 1 (typically 5-10)
Symmetric (left and right content independent)
Sample Corpus

I like the deep learning.
I like NLP.
I Enjoy flying

Problems that exist

Scale increases with the increase of corpus vocabulary
Very high dimensions: requires a lot of storage
The taxonomy model is experiencing sparse problems
The model is not strong enough

Solution: low-dimensional vectors

Idea: Store the most important information in a fixed, low-dimensional vector: dense vector (dense vector)
The number of dimensions is usually 25-1000
Question: How to reduce dimension?

Method 1:svd (singular value decomposition)

Singular value decomposition for a co-existing matrix X

Simple word vector SVD decomposition in Python

Corpus: I like deep learning. I like NLP. I Enjoy flying

Print the first two columns of the U matrix This also corresponds to the maximum of two singular values

Use vectors to define the meaning of a word:

In the relevant model, including the deep learning model, a word is often represented by a dense vector (dense vector)

Hacks to X

Functional words (the, he, has) are too frequent and have a great effect on grammar, and the solution is to reduce the use or ignore the functional words altogether
The extension window increases the count of adjacent words
Replace the count with the Pearson correlation factor, with a negative number of 0
+++

Some interesting semantic patterns appearing in the vector of words

the following are from:

An improved model of semantic similarity based on lexical co-occurence

Problems with the use of SVD

The computational time complexity for the n*m matrix is O (mn^2) when n<m, when the word or document count is bad for millions of times < li= "" >
New words or new documents are difficult to update in time
Compared to other DL models, there are different learning frameworks

Solution: Direct learning of low-dimensional word vectors

Some methods: related to this lecture and deep learning

Learning Representations by back-propagating errors (Rumelhart et al.,1986)
A Neural Probabilistic Language Model (Bengio et al., 2003)
Natural Language processing (almost) from Scratch (Collobert & weston,2008)
Word2vec (Mikolov et al)-Introduction to this lecture

The main idea of Word2vec

Unlike general co-occurrence counts, word2vec mainly to predict words around words
GloVe and Word2vec similar ideas: Glove:global Vectors for Word representation
Easily and quickly incorporate new sentences and documents or add new words into the glossary

The main idea of Word2vec

Predict the probability of the surrounding word for each word in a window with a Windows length of C
Objective function: To maximize the log probability of any word around it for a central word

for p (w< Span id= "mathjax-span-7" class= "Texatom" > t+ J/wt) The simplest expression is:
Here the V and v s distributions are the "input" and "output" vectors of W (so each w has two vectors represented)
This is the basic "dynamic" Logistic regression ("regression")

Cost/Objective function

Our goal is to optimize (maximize or minimize) the cost/objective function
Common methods: Gradient descent

An example (from Wikipedia): Looking for a functionF(x)=x4–3x3+2 The local minimum point, whose derivative is F ′ (x ) =4x3–9 x2
Python code:

Derivative of the gradient

Whiteboard (suggest that students who do not have a direct class take a look at the Whiteboard deduction in the course video)
A useful formula

Chain rule

The linear relationship in Word2vec

This kind of expression can be very good to encode the similarity of words

The dimension of similarity in embedded space can be tested by using the subtraction of vectors.

Method of counting vs direct prediction

GloVe: Combining the advantages of two types of methods

Faster training
It is also very extensible for large-scale corpus algorithms.
Good performance on small corpora or small vectors.

The effect of glove

The most similar word of the English word Frog (frog)

Word Analogies (speech ratio)

Test the linear relationship between words (Mikolov et al. (2014))

Glove Visualization One

Glove Visualization II: COMPANY-CEO

Glove Visualization III: superlatives

Word Embedding matrix (Word embedding matrices)

Pre-trained Word embedding matrix

Also called query table (look-up table)

Advantages of low-dimensional word vectors

What are the biggest advantages of deep learning word vectors?
Any information can be represented as a word vector and then propagated through a neural network.

The word vectors will be the basis of the later chapters
All of our semantic representations will be in the form of vectors.
Long phrases and sentences can also be combined into more complex representations through the form of word vectors to solve more complex tasks –> next

Course Notes Index:
Stanford University Deep Learning and Natural language processing first: Introduction

Resources:
Deep Learning in NLP (i) Word vectors and language models
Singular value decomposition (We recommend a Singular value decomposition)

The second lecture on deep learning and natural language processing at Stanford University

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More