Google Deep Learning notes cyclic neural network practice

Source: Internet
Author: User

Reprint please indicate in Dream Wind forest
GitHub Project Address: https://github.com/ahangchen/GDLnotes
Welcome to star, you can discuss it in issue area.
Official Tutorial Address
Video/subtitle Download

Loading data
    • Use Text8 as a trained text dataset

The TEXT8 contains only 27 characters: lowercase from A to Z, and whitespace. If you hit it out, it reads like Wikipedia, which removes all punctuation.

    • Direct call to Lesson1 in Maybe_download download text8.zip
    • Use ZipFile to read the zip content as a string and split it into a list of words
    • Use the Connections module to count the number of words and find the most common words

Achieving the goal of random data fetching

Constructing Calculated Cells
embeddings = tf.Variable(        tf.random_uniform([vocabulary_size, embedding_size], -1.01.0))
    • Constructs a matrix of Vocabulary_size x Embedding_size, as a embeddings container,
    • There are vectors with vocabulary_size capacity of embedding_size, each vector representing a vocabulary,
    • The values of the components in each vector are randomly distributed between 1 and 1
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    • Call Tf.nn.embedding_lookup, the index corresponding to the Train_dataset vector, equivalent to using Train_dataset as an ID, to retrieve the matrix in the corresponding to this ID embedding
loss = tf.reduce_mean(        tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,                                   train_labels, num_sampled, vocabulary_size))
    • Sample Calculation Training Loss
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
    • Adaptive gradient regulator adjusts the data of the embedding list to minimize deviations

    • Prediction, using the Cos value to calculate the angle of the prediction vector and the actual data as the prediction accuracy (similarity) indicator

Incoming data for training
    • Cutting data is used for training, where:
1) % len(data)
    • is still a part of the random data incoming
      • Intercept a small piece of text at equidistant distance
      • Construct the training set: the middle position of each intercept window as a train_data
      • Construct Tags: each capture window, in addition to the train_data, randomly take a few to become a list, as a label (here only randomly take one)
      • This creates a mechanism for predicting the context based on the target vocabulary, i.e. Skip-gram
    • Training 100,001 times, the average loss of 2000 times per 2000 output
    • Calculates the similarity every 10,000 times and outputs the list of words closest to the word in the validation set
    • Using Tsne to reduce the degree of lexical proximity
    • Drawing results with Matplotlib

Implementation code See word2vec.py

Cbow

The Skip-gram model is trained to predict the context based on the target vocabulary, and Word2vec also has a way to cbow the target vocabulary based on the context.

In effect, the input and output in the Skip-gram is reversed.

    • Modify how data is intercepted

      • Construct Tags: the middle position of each window as a Train_label
      • Construct the training set: each interception window, except for Train_label, as Train_data (only one is randomly taken here)
      • This creates a mechanism for predicting the target vocabulary based on context, i.e. Cbow
    • Find each word vector in Train_data from embeding, add it with tf.reduce_sum, compare the results with Train_label

# Look up embeddings for inputs.embed = tf.nn.embedding_lookup(embeddings, train_dataset)# sum up vectors on first dimensions, as context vectors0)
    • Training is still the parameter to adjust the embeding to optimize the loss
    • Training results such as, you can see how close the different words

Code See:
cbow.py

RNN sentence

The overall idea is to use a word in a text as train data, followed by all the words as train label, so that the subsequent fragments can be predicted based on a given word.

Training data
    • Batchgenerator
      • Text: All Textual data
      • Text_size: string length for all text
      • Batch_size: The size of each training data
      • Num_unrollings: Number of training data segments to be generated
      • Segment: The entire training data set can be divided into several training data fragments
      • Cursor: Important,
      • Start recording the starting position coordinates of each training data fragment, that is, which index of this fragment is in text
      • When executing next_batch generates a training data, the cursor will increment from its initial position until it has enough batch_size data
      • Last_batch: Previous training data fragment
      • Each time you call next, a num_unrollings long array is generated, starting with Last_batch, followed by num_unrollings batch
      • As a train_input per batch, a batch behind each batch acts as a Train_label, each step is trained num_unrolling a batch
Lstm-cell
    • In order to solve the vanishing gradient problem, the Lstm-cell is introduced to enhance the memory ability of model.
    • According to this paper design lstm-cell:http://arxiv.org/pdf/1402.1128v1.pdf
    • There are three doors: input door, forgotten door, output door, form a cell
      • The input data is a num_nodes word, and there may be vocabulary_size of words.
      • Input door:
  input_gate = sigmoid(i * ix + o * im + ib)
- 给输入乘一个vocabulary_size * num_nodes大小的矩阵,给输出乘一个num_nodes * num_nodes大小的矩阵;- 用这两个矩阵调节对输入数据的取舍程度- 用sigmoid这个非线性函数进行激活
    • Forgotten Door:
  forget_gate = sigmoid(i * fx + o * fm + fb)

Ideas and input doors to make choices about historical data

    • Output Gate:
  output_gate = sigmoid(i * ox + o * om + ob)

Ideas and input doors, to choose the output status

    • Combination:
  update = i * cx + o * cm + cb  state = forget_gate * state + input_gate * tanh(update)  lstm_cell = output_gate * tanh(state)
- 用同样的方式构造新状态update- 用遗忘门处理历史状态state- 用tanh激活新状态update- 用输入门处理新状态update- 整合新旧状态,再用tanh激活状态state- 用输出门处理state
LSTM optimization

In the cell above, the Update,output_gate,forget_gate,input_gate calculation method is the same,
Four sets of parameters can be combined, calculated once, and then extracted separately:

values = tf.split(1, gate_count, tf.matmul(i, input_weights) + tf.matmul(o, output_weights) + bias)input_gate = tf.sigmoid(values[0])forget_gate = tf.sigmoid(values[1])update = values[2]

Then throw the output of Lstm-cell into a wx+b to adjust as output

Implementation code See singlew_lstm.py

Optimizer
    • Using One-hot encoding as a label prediction
    • Calculating loss with cross-entropy
    • Introduction of learning rate decay
Flow
    • Fill in the training data into the placeholder
    • The accuracy of the validation set is calculated using Logprob, that is, the probability logarithm
    • Every 10 sessions randomly pick 5 letters as starting words and make sentence tests.
    • You may notice that the sentence of the output is made up of the words given by sample, rather than the words with the highest probability, because if the word with the highest probability is always taken, the most likely word will be repeated at the end.

Implementation code See lstm.py

Beam Search

In the above process, each time is a character as a unit, you can use a bit more characters to make predictions, the highest probability of the one, to prevent special circumstances caused by false

Here we add 2 characters to form bigram, code see: bigram_lstm.py

Mainly through the Bigrambatchgenerator class implementation

Embedding Look up

As the Bigram case, vocabulary_size into 27*27, using one-hot encoding to do predict will produce a very sparse matrix, waste computing power, the calculation speed is slow

So introduce embedding_lookup, code see embed_bigram_lstm.py

    • Data input: Batchgenerator no longer generates one-hot-encoding vectors as input, but directly generates Bigram corresponding index list
    • Embedding look up adjusts the embedding so that the bigram corresponds to the vector
    • Feed the results of embedding look up to lstm cell
    • Output, you need to convert both the label and output to one-hot-encoding to calculate the loss with cross-entropy and Softmax
    • When you do data to one-hot-encoding conversion in tensor, you rely primarily on the Tf.gather function
    • When converting valid data, the main dependence of the ONE_HOT_VOC function
Drop out
    • Drop out of input and output in the Lstm cell
    • Refer to this article
Seq2seq
    • The last question is to convert each word in a sentence to its inverse string, which is a seq-to-seq conversion
    • The serious idea is that Word 2 vector 2 lstm 2 vector 2 word
    • But TensorFlow has a model to do this: Seq2seqmodel, this model can be seen in this analysis
      And the example of TensorFlow.
    • Only from batch, according to the law of string reverse order to generate the target sequence, put in Seq2seqmodel, mainly rely on rev_id function
    • realization See seq2seq.py
    • Note that when using Seq2seqmodel, size and Num_layer will converge before learning the right rules, and I'll turn it up a bit.
 def Create_model(sess, forward_only):Model = Seq2seq_model.                                       Seq2seqmodel (Source_vocab_size=vocabulary_size, Target_vocab_size=vocabulary_size, buckets=[( -, +)], size= the, num_layers=4, max_gradient_norm=5.0, Batch_size=batch_size, learning_rate=1.0, learning_rate_decay_factor=0.9, use_lstm=True, forward_only=forward_only)returnModel
  • Parameter meaning
    • Source_vocab_size:size of the source vocabulary.
    • Target_vocab_size:size of the target vocabulary.
    • BUCKETS:A List of pairs (I, O), where I specifies maximum input length
      That'll be processed in that bucket, and O specifies maximum output
      Length. Training instances that has inputs longer than I or outputs
      Longer than O'll be pushed to the next bucket and padded accordingly.
      We assume the list is sorted, e.g., [(2, 4), (8, 16)].
    • Size:number of units in each layer of the model.
    • Num_layers:number of layers in the model.
    • Max_gradient_norm:gradients'll is clipped to maximally this norm.
    • Batch_size:the size of the batches used during training;
      The model construction is independent of batch_size, so it can be
      Changed after initialization if the is convenient, e.g., for decoding.
    • Learning_rate:learning rate to start with.
    • Learning_rate_decay_factor:decay Learning-much when needed.
    • Use_lstm:if true, we use lstm cells instead of GRU cells.
    • Num_samples:number of samples for sampled Softmax.
    • Forward_only:if set, we don't construct the backward pass in the model.
Reference links
    • Lin Zhuhan-Know
    • Word vector
    • Rudolfix-udacity_deeplearn
    • EDWARDBI-Parse TensorFlow official English-franch translator Demo

If you feel that my article is helpful to you, may I have a star?

Google Deep Learning notes cyclic neural network practice

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.