TensorFlow the depth model of text and sequences

Source: Internet
Author: User

TensorFlow Depth Learning note text with sequence depth model deep Models for text and Sequence

Reprint please indicate in Dream Wind forest
GitHub Project Address: https://github.com/ahangchen/GDLnotes
Welcome to star, you can discuss it in issue area.
Official Tutorial Address
Video/subtitle Download

Rare Event

Unlike other machine learning, in text analysis, unfamiliar things (rare event) are often the most important, and the most common things are often the least important.

Grammatical ambiguity
    • A thing may have multiple names, it is best to share this related text with parameters
    • If you need to recognize a word and identify its relationship, you need to overload the label data
Unsupervised learning
    • The training text is very much without the label, the key is to find the training content
    • Follow the idea that similar words appear in similar situations
    • There is no need to know the true meaning of a word, the meaning of the word is determined by the historical environment in which it is located
Embeddings
    • Map a word to a vector (Word2vec), the more similar the vector of the word will be closer
    • New words can get shared parameters from context
Word2vec

    • Map each word to a list of vectors (that is, a embeddings), start at random, and use this embedding to predict
    • Context is the neighbor of the vector list
    • The goal is to place the words in the window adjacent to each other, that is, to predict the neighbor of a word
    • The model used to predict the words in these adjacent positions is just a logistics Regression, just a simple Linear model

      Comparing embeddings
    • Compare the angle size between the two vectors to determine the proximity, using the Cos value instead of the L2 calculation, because the vector length and classification are irrelevant:

    • The best vectors to be counted are normalized.
Predict Words

    • The word passes embedding into a vector
    • Then enter a wx+b to make a linear model
    • The label probability of the output is the word in the input text
    • The problem is that when the wx+b output, the label is too many, the calculation of this softmax is inefficient
    • The solution is to sift out the label that cannot be the target, just calculate the probability of a label in a local, sample Softmax
T-sne
    • View a word the nearest neighbor in embedding can see the semantic proximity relationship between words
    • The space dimension of vectors makes it possible to find the nearest word more efficiently, but to maintain the neighbor relationship in the process of dimensionality reduction (close to descending dimension)
    • T-sne is such an effective method.
Analogy
    • In fact, what we can get is not only the adjacency of the word, because the word vectorization, the words can be calculated
    • can be calculated to make semantic addition and subtraction, syntax addition and subtraction

Sequence

Text is a sequence of words (word), a key feature of which is a variable length that cannot be changed directly into a vector

CNN and RNN

CNN shares parameters in space, Rnn share parameters in time (sequential)

    • In each round of training, it is necessary to judge what has happened so far, and all the data entered in the past has affected the current classification.
    • One way of thinking is to memorize the state of the classifier before it, to train a new classifier on this basis, thus combining historical influences
    • This requires a lot of historical classifiers.
    • Reuse the classifier, summarize the status with only one classifier, and other classifiers to receive training for the corresponding time, then pass the state

RNN derivatives
    • BackPropagation Through Time
    • For the same weight parameter, there are many derivative operations that are updated at the same time
    • Not friendly to SGD because SGD is updating parameters with many unrelated derivatives to ensure the stability of the training
    • Gradient explosions or gradients disappear due to the correlation between gradients

    • So that the training can not find the direction of optimization, training failure
Clip Gradient

Calculate to gradient explosion, use a ratio to replace W (gradient is the return calculation, the horizontal axis from right to left to see)

    • Hack but cheap and effective
LSTM (Long short-term Memory)

The disappearance of the gradient causes the classifier to react only to changes in the recent message, to downplay previously trained parameters, and not to use a ratio method to solve

    • A RNN model contains two inputs, one is a past state, one is a new data, two outputs, one is a prediction, one is a future state

    • In the middle is a simple neural network
    • Switching the middle part to Lstm-cell will solve the gradient vanishing problem.
    • Our aim is to improve the memory capacity of RNN.
    • Memory Cell

Three gates, decide whether to write/read/forget/write back

    • On each door, instead of simply making yes/no judgments, use a weight to determine the degree of acceptance of the input
    • This weight is a continuous function, can be derivative, can also be trained, this is the core of lstm

    • Use a logistic regression to train these doors, normalized in the output

    • This model allows the entire cell to remember and forget better.
    • Since the whole model is linear, it is easy to differentiate and train
LSTM regularization
    • L2, Works
    • Dropout on the input or output of data, works
Beam Search

With the above model, we can speculate on the following, and even create the following, predict, filter the maximum probability of the word, feed back, continue to predict ...

    • We can only predict one letter at a time, but it's greedy, and pick the best one every time.
    • You can also predict a few steps at a time, and then pick the one with the higher overall probability to reduce the impact of accidental factors.
    • But the sequence that need to be generated will grow exponentially.
    • So when we're predicting a few more steps, we only make predictions for a few candidates with a higher probability, that's beam search.
Translation and Knowledge map
    • RNN the variable length sequence problem into the fixed length vector problem, and since we can actually use vectors to make predictions, we can also turn vectors into sequence

    • We can use this to enter a sequence into a rnn, input the output into another inverse rnn sequence, and form another sequence, for example, language translation
    • If we connect CNN's output to a rnn, we can do a graph-reading system.

Practice of cyclic neural network

TensorFlow the depth model of text and sequences

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.