The second lecture on deep learning and natural language processing at Stanford University

Source: Internet
Author: User

Second lecture: Simple word vector representation: Word2vec, Glove (easy word vector representations:word2vec, Glove)

Reprint please specify the source and retention link "I love Natural Language processing": http://www.52nlp.cn

This article link address: Stanford University deep Learning and Natural language processing second: Word vector

Recommended Reading materials:

    1. paper1:[distributed representations of Words and phrases and their compositionality]]
    2. Paper2:[efficient estimation of Word representations in Vector Space]
    3. Second lecture slides [slides]
    4. Video of the second lecture

The following is a second related note, mainly from the slides of the course, video and other relevant information.

How to express the meaning of a word (meaning)

      • The definition of English word meaning (from Webster's Dictionary)
        • The idea that's represented by a word, phrase, etc.
        • The idea of a person wants to express by using words, signs, etc.
        • The idea was expressed in a work of writing, art, etc.

How to express the meaning of a word in a computer

        • A semantic dictionary such as WordNet is usually used, including the upper Word (is-a) relationship and the synonym set
        • Panda of the upper word, from the NLTK in the WordNet interface demonstration

        • Set of synonyms for good

Problems in semantic dictionaries

        • Semantic dictionary resources are great but there may be some nuances that are missing, such as the exact synonyms: Adept, expert, good, practiced, proficient,skillful?
        • Will miss some new words, almost impossible to update in time: Wicked, badass, nifty, crack, Ace, Wizard, Genius, Ninjia
        • Have a certain subjective inclination
        • Need a lot of manpower and resources
        • It's hard to calculate the similarity between two words.



One-hot representation

        • Traditional rule-based or statistical-based natural semantic processing approaches words as an atomic symbol: hotel, conference, walk
        • In the context of vector space, this is a vector representation of 1 lots of 0: [0,0,0,0,..., 0,1,0,..., 0,0,0]
        • Dimensions: 20K (Speech) –50k (PTB) –500k (Big vocab) –13m (Google 1T)
        • This is the "one-hot" said that there is an important problem with this representation is the "lexical gap" phenomenon: Any two words are isolated. There is no relationship between the two words of light from these two vectors:

Distributional Similarity based representations

        • A lot of knowledge of the word can be learned through the context of a word.

        • This is a very successful view of modern statistical NLP

How to use context to represent words

        • Answer: Using the co-existing matrix (cooccurrence matrix) X
          • 2 options: Full text or window length
          • Word-document's co-existing matrix will eventually get generalized themes (such as sports words will have similar markings), which is shallow semantic analysis (LSA, latent Semantic analyst)
          • Window length easy to capture syntax (POS) and semantic information



Window-based co-occurrence matrix: A simple example

        • Window length is 1 (typically 5-10)
        • Symmetric (left and right content independent)
        • Sample Corpus
          • I like the deep learning.
          • I like NLP.
          • I Enjoy flying

Problems that exist

        • Scale increases with the increase of corpus vocabulary
        • Very high dimensions: requires a lot of storage
        • The taxonomy model is experiencing sparse problems
        • The model is not strong enough



Solution: low-dimensional vectors

        • Idea: Store the most important information in a fixed, low-dimensional vector: dense vector (dense vector)
        • The number of dimensions is usually 25-1000
        • Question: How to reduce dimension?



Method 1:svd (singular value decomposition)

        • Singular value decomposition for a co-existing matrix X

Simple word vector SVD decomposition in Python

        • Corpus: I like deep learning. I like NLP. I Enjoy flying

        • Print the first two columns of the U matrix This also corresponds to the maximum of two singular values

Use vectors to define the meaning of a word:

        • In the relevant model, including the deep learning model, a word is often represented by a dense vector (dense vector)

Hacks to X

        • Functional words (the, he, has) are too frequent and have a great effect on grammar, and the solution is to reduce the use or ignore the functional words altogether
        • The extension window increases the count of adjacent words
        • Replace the count with the Pearson correlation factor, with a negative number of 0
        • +++



Some interesting semantic patterns appearing in the vector of words

          the following are from:

An improved model of semantic similarity based on lexical co-occurence


Problems with the use of SVD

        • The computational time complexity for the n*m matrix is O (mn^2) when n<m, when the word or document count is bad for millions of times < li= "" >
        • New words or new documents are difficult to update in time
        • Compared to other DL models, there are different learning frameworks



Solution: Direct learning of low-dimensional word vectors

        • Some methods: related to this lecture and deep learning
          • Learning Representations by back-propagating errors (Rumelhart et al.,1986)
          • A Neural Probabilistic Language Model (Bengio et al., 2003)
          • Natural Language processing (almost) from Scratch (Collobert & weston,2008)
          • Word2vec (Mikolov et al)-Introduction to this lecture



The main idea of Word2vec

        • Unlike general co-occurrence counts, word2vec mainly to predict words around words
        • GloVe and Word2vec similar ideas: Glove:global Vectors for Word representation
        • Easily and quickly incorporate new sentences and documents or add new words into the glossary



The main idea of Word2vec

        • Predict the probability of the surrounding word for each word in a window with a Windows length of C
        • Objective function: To maximize the log probability of any word around it for a central word

      • for p (w< Span id= "mathjax-span-7" class= "Texatom" > t+ J/wt) The simplest expression is:
      • Here the V and v s distributions are the "input" and "output" vectors of W (so each w has two vectors represented)
      • This is the basic "dynamic" Logistic regression ("regression")



Cost/Objective function

        • Our goal is to optimize (maximize or minimize) the cost/objective function
        • Common methods: Gradient descent

      • An example (from Wikipedia): Looking for a functionF(x)=x4–3x3+2 The local minimum point, whose derivative is F ′ (x ) =4x3–9 x2
      • Python code:

Derivative of the gradient

        • Whiteboard (suggest that students who do not have a direct class take a look at the Whiteboard deduction in the course video)
        • A useful formula

        • Chain rule

The linear relationship in Word2vec

        • This kind of expression can be very good to encode the similarity of words
          • The dimension of similarity in embedded space can be tested by using the subtraction of vectors.

Method of counting vs direct prediction

GloVe: Combining the advantages of two types of methods

        • Faster training
        • It is also very extensible for large-scale corpus algorithms.
        • Good performance on small corpora or small vectors.

The effect of glove

        • The most similar word of the English word Frog (frog)

Word Analogies (speech ratio)

        • Test the linear relationship between words (Mikolov et al. (2014))

Glove Visualization One

Glove Visualization II: COMPANY-CEO

Glove Visualization III: superlatives

Word Embedding matrix (Word embedding matrices)

        • Pre-trained Word embedding matrix

        • Also called query table (look-up table)

Advantages of low-dimensional word vectors

        • What are the biggest advantages of deep learning word vectors?
        • Any information can be represented as a word vector and then propagated through a neural network.

        • The word vectors will be the basis of the later chapters
        • All of our semantic representations will be in the form of vectors.
        • Long phrases and sentences can also be combined into more complex representations through the form of word vectors to solve more complex tasks –> next

Course Notes Index:
Stanford University Deep Learning and Natural language processing first: Introduction

Resources:
Deep Learning in NLP (i) Word vectors and language models
Singular value decomposition (We recommend a Singular value decomposition)

The second lecture on deep learning and natural language processing at Stanford University

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.