Second lecture: Simple word vector representation: Word2vec, Glove (easy word vector representations:word2vec, Glove)
Reprint please specify the source and retention link "I love Natural Language processing": http://www.52nlp.cn
This article link address: Stanford University deep Learning and Natural language processing second: Word vector
Recommended Reading materials:
- paper1:[distributed representations of Words and phrases and their compositionality]]
- Paper2:[efficient estimation of Word representations in Vector Space]
- Second lecture slides [slides]
- Video of the second lecture
The following is a second related note, mainly from the slides of the course, video and other relevant information.
How to express the meaning of a word (meaning)
- The definition of English word meaning (from Webster's Dictionary)
- The idea that's represented by a word, phrase, etc.
- The idea of a person wants to express by using words, signs, etc.
- The idea was expressed in a work of writing, art, etc.
How to express the meaning of a word in a computer
- A semantic dictionary such as WordNet is usually used, including the upper Word (is-a) relationship and the synonym set
- Panda of the upper word, from the NLTK in the WordNet interface demonstration
Problems in semantic dictionaries
- Semantic dictionary resources are great but there may be some nuances that are missing, such as the exact synonyms: Adept, expert, good, practiced, proficient,skillful?
- Will miss some new words, almost impossible to update in time: Wicked, badass, nifty, crack, Ace, Wizard, Genius, Ninjia
- Have a certain subjective inclination
- Need a lot of manpower and resources
- It's hard to calculate the similarity between two words.
One-hot representation
- Traditional rule-based or statistical-based natural semantic processing approaches words as an atomic symbol: hotel, conference, walk
- In the context of vector space, this is a vector representation of 1 lots of 0: [0,0,0,0,..., 0,1,0,..., 0,0,0]
- Dimensions: 20K (Speech) –50k (PTB) –500k (Big vocab) –13m (Google 1T)
- This is the "one-hot" said that there is an important problem with this representation is the "lexical gap" phenomenon: Any two words are isolated. There is no relationship between the two words of light from these two vectors:
Distributional Similarity based representations
- A lot of knowledge of the word can be learned through the context of a word.
- This is a very successful view of modern statistical NLP
How to use context to represent words
- Answer: Using the co-existing matrix (cooccurrence matrix) X
- 2 options: Full text or window length
- Word-document's co-existing matrix will eventually get generalized themes (such as sports words will have similar markings), which is shallow semantic analysis (LSA, latent Semantic analyst)
- Window length easy to capture syntax (POS) and semantic information
Window-based co-occurrence matrix: A simple example
- Window length is 1 (typically 5-10)
- Symmetric (left and right content independent)
- Sample Corpus
- I like the deep learning.
- I like NLP.
- I Enjoy flying
Problems that exist
- Scale increases with the increase of corpus vocabulary
- Very high dimensions: requires a lot of storage
- The taxonomy model is experiencing sparse problems
- The model is not strong enough
Solution: low-dimensional vectors
- Idea: Store the most important information in a fixed, low-dimensional vector: dense vector (dense vector)
- The number of dimensions is usually 25-1000
- Question: How to reduce dimension?
Method 1:svd (singular value decomposition)
- Singular value decomposition for a co-existing matrix X
Simple word vector SVD decomposition in Python
- Corpus: I like deep learning. I like NLP. I Enjoy flying
- Print the first two columns of the U matrix This also corresponds to the maximum of two singular values
Use vectors to define the meaning of a word:
- In the relevant model, including the deep learning model, a word is often represented by a dense vector (dense vector)
Hacks to X
- Functional words (the, he, has) are too frequent and have a great effect on grammar, and the solution is to reduce the use or ignore the functional words altogether
- The extension window increases the count of adjacent words
- Replace the count with the Pearson correlation factor, with a negative number of 0
- +++
Some interesting semantic patterns appearing in the vector of words
An improved model of semantic similarity based on lexical co-occurence
Problems with the use of SVD
- The computational time complexity for the n*m matrix is O (mn^2) when n<m, when the word or document count is bad for millions of times < li= "" >
- New words or new documents are difficult to update in time
- Compared to other DL models, there are different learning frameworks
Solution: Direct learning of low-dimensional word vectors
- Some methods: related to this lecture and deep learning
- Learning Representations by back-propagating errors (Rumelhart et al.,1986)
- A Neural Probabilistic Language Model (Bengio et al., 2003)
- Natural Language processing (almost) from Scratch (Collobert & weston,2008)
- Word2vec (Mikolov et al)-Introduction to this lecture
The main idea of Word2vec
- Unlike general co-occurrence counts, word2vec mainly to predict words around words
- GloVe and Word2vec similar ideas: Glove:global Vectors for Word representation
- Easily and quickly incorporate new sentences and documents or add new words into the glossary
The main idea of Word2vec
- Predict the probability of the surrounding word for each word in a window with a Windows length of C
- Objective function: To maximize the log probability of any word around it for a central word
- for p (w< Span id= "mathjax-span-7" class= "Texatom" > t+ J/wt) The simplest expression is:
- Here the V and v s distributions are the "input" and "output" vectors of W (so each w has two vectors represented)
- This is the basic "dynamic" Logistic regression ("regression")
Cost/Objective function
- Our goal is to optimize (maximize or minimize) the cost/objective function
- Common methods: Gradient descent
- An example (from Wikipedia): Looking for a functionF(x)=x4–3x3+2 The local minimum point, whose derivative is F ′ (x ) =4x3–9 x2
- Python code:
Derivative of the gradient
- Whiteboard (suggest that students who do not have a direct class take a look at the Whiteboard deduction in the course video)
- A useful formula
The linear relationship in Word2vec
- This kind of expression can be very good to encode the similarity of words
- The dimension of similarity in embedded space can be tested by using the subtraction of vectors.
Method of counting vs direct prediction
GloVe: Combining the advantages of two types of methods
- Faster training
- It is also very extensible for large-scale corpus algorithms.
- Good performance on small corpora or small vectors.
The effect of glove
- The most similar word of the English word Frog (frog)
Word Analogies (speech ratio)
- Test the linear relationship between words (Mikolov et al. (2014))
Glove Visualization One
Glove Visualization II: COMPANY-CEO
Glove Visualization III: superlatives
Word Embedding matrix (Word embedding matrices)
- Pre-trained Word embedding matrix
- Also called query table (look-up table)
Advantages of low-dimensional word vectors
- What are the biggest advantages of deep learning word vectors?
- Any information can be represented as a word vector and then propagated through a neural network.
- The word vectors will be the basis of the later chapters
- All of our semantic representations will be in the form of vectors.
- Long phrases and sentences can also be combined into more complex representations through the form of word vectors to solve more complex tasks –> next
Course Notes Index:
Stanford University Deep Learning and Natural language processing first: Introduction
Resources:
Deep Learning in NLP (i) Word vectors and language models
Singular value decomposition (We recommend a Singular value decomposition)
The second lecture on deep learning and natural language processing at Stanford University