Preface
There are few practical projects in the direct use of depth learning to achieve end-to-end chat robot, but here we look at how to use the depth of learning SEQ2SEQ model to achieve a simple chat robot. This article will try to use TensorFlow to train a seq2seq chat robot to enable robots to answer questions based on corpus training. Seq2seq
The mechanism of SEQ2SEQ can be seen in the previous article "in-depth study of the SEQ2SEQ model." Cyclic neural network
Cyclic neural networks are used in the SEQ2SEQ model, and several circulating neural networks are currently popular, including RNN, Lstm and GRU. The mechanism of these three kinds of cyclic neural networks can be seen in the previous article "Circular Neural Network" "Lstm Neural Network" "GRU Neural network." Training Sample Set
Mainly some QA pairs, open data can also be downloaded, here is just a small selection of questions and answers, stored in the format is the first behavior problem, the second action answer, the third line is a problem, the fourth act to answer, and so on. Data preprocessing
To train is sure to turn the data into numbers, you can use the value of 0 to N to represent the entire vocabulary, each value represents a word, here with vocab_size to define. There is also the maximum minimum length of the problem, the maximum minimum length of the answer. In addition to the definition of UNK, go, Eos and pad symbols, respectively, for unknown words, such as you more than vocab_size range of the thought of unknown words, go to represent the beginning of the decoder symbol, EOS is the answer to the end of the symbol, and pad for filling, Because all QA pairs must be identical in input and output to the same seq2seq model, a shorter length question or answer is padded with a pad.
Limit = {
' MAXQ ': ten,
' Minq ': 0,
' Maxa ': 8,
' Mina ': 3
}
UNK = ' UNK ' go
= ' <go>
' EOS = ' <eos> '
PAD = ' <pad> '
vocab_size = 1000
1 2 3 4 5 6 7 8 9 10 11-12
Filter according to the QA length limit.
def filter_data (sequences):
filtered_q, filtered_a = [], []
Raw_data_len = len (sequences)//2 for
I in range ( 0, Len (sequences), 2):
qlen, alen = Len (Sequences[i].split (")), Len (sequences[i + 1].split ("))
if Qlen >= l imit[' Minq '] and Qlen <= limit[' maxq ':
if Alen >= ' Mina '] and limit[alen <= ' limit[']:
Maxa Ppend (Sequences[i])
filtered_a.append (sequences[i + 1])
Filt_data_len = Len (filtered_q)
filtered = Int ( Raw_data_len-filt_data_len) * 100/raw_data_len)
print (str (filtered) + '% filtered from original data ')
retur N Filtered_q, Filtered_a
1 2 3 4, 5 6 7 8 9 10 11 12 13 14 15
We also have to get the whole corpus of all the words frequency statistics, but also according to the frequency of the size of the top n frequency of words as the whole vocabulary, that is, the previous corresponding vocab_size. In addition, we need to index the words according to the indexes, and the index of the corresponding index according to the words.
def index_ (Tokenized_sentences, vocab_size):
freq_dist = nltk. Freqdist (Itertools.chain (*tokenized_sentences))
vocab = Freq_dist.most_common (vocab_size)
Index2word = [Go ] + [EOS] + [UNK] + [PAD] + [x[0] for X-vocab]
word2index = Dict ([(w, i) for I, W in Enumerate (Index2word)])
re Turn Index2word, Word2index, freq_dist
1 2 3 4 5 6
Before also said in our SEQ2SEQ model, for encoder, the length of the problem is different, so not long enough to use pad to fill, such as the problem is "How are You", if the length of 10, you need to fill it as "How are you pad Pad Pad Pads Pad ". For decoder, to start with the end of Eos, not long enough to fill, such as "fine Thank you", it will be processed into "go fine thank your EOS pad pad pad pad." The third to deal with is our target,target in fact and decoder input is the same, but it just has a position offset, such as the above to remove go, become "fine thank you EOS pad pad pad pad pad."
def zero_pad (qtokenized, atokenized, w2idx): Data_len = Len (qtokenized) # +2 dues to ' <go> ' and ' <eos> ' Idx_q = Np.zeros ([Data_len, limit[' MaxQ '], dtype=np.int32) idx_a = Np.zeros ([Data_len, limit[' Maxa '] + 2], Dtyp E=np.int32) Idx_o = Np.zeros ([Data_len, limit[' Maxa '] + 2], dtype=np.int32) for I in Range (Data_len): q_i Ndices = Pad_seq (Qtokenized[i], w2idx, limit[' MAXQ '], 1) a_indices = Pad_seq (atokenized[i), W2idx, limit[' Maxa '], 2) o_indices = Pad_seq (Atokenized[i], w2idx, limit[' Maxa '], 3) idx_q[i] = Np.array (q_indices) idx_ A[i] = Np.array (a_indices) idx_o[i] = Np.array (o_indices) return idx_q, Idx_a, idx_o def pad_seq (seq, Looku P, MaxLen, flag): if flag = = 1:indices = [] elif flag = = 2:indices = [Lookup[go]] elif flag
= = 3:indices = [] for word in Seq:if word in lookup:indices.append (Lookup[word) Else:indiCes.append (Lookup[unk]) if flag = = 1:return indices + [Lookup[pad]] * (Maxlen-len (seq)) elif flag = 2: return indices + [Lookup[eos]] + [Lookup[pad]] * (Maxlen-len (seq)) elif flag = = 3:return Indices + [Lookup[eos]] + [Lookup[pad]] * (Maxlen-len (seq) + 1)
1, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
The above-treated structure is then persisted for use during training. Build Diagram
Encoder_inputs = Tf.placeholder (Dtype=tf.int32, Shape=[batch_size, Sequence_length])
decoder_inputs = Tf.placeholder (Dtype=tf.int32, Shape=[batch_size, Sequence_length])
targets = Tf.placeholder (Dtype=tf.int32, Shape=[batch_size, Sequence_length])
weights = Tf.placeholder (Dtype=tf.float32, Shape=[batch_size, Sequence_ Length])
1 2 3 4
Create four placeholders, which are encoder input placeholders, decoder input placeholders, and decoder target placeholders, as well as weighting placeholders. Where Batch_size is the number of input samples, sequence_length the length of the sequence defined for us.
Cell = Tf.nn.rnn_cell. Basiclstmcell (hidden_size)
cell = Tf.nn.rnn_cell. Multirnncell ([cell] * num_layers)
1 2
Create a cyclic neural network structure, where the LSTM structure is used, hidden_size is the number of hidden layers, multirnncell because we want to create a more complex network, num_layers to the number of lstm layers.
Results, states = Tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq (
tf.unstack (encoder_inputs, Axis=1),
Tf.unstack (Decoder_inputs, Axis=1),
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_ Size,
feed_previous=false
)
1 2 3 4 5 6 7 8 9
Using TensorFlow to build the SEQ2SEQ structure for the EMBEDDING_RNN_SEQ2SEQ function we have prepared, we can, of course, set ourselves up from LSTM, create encoder and decoder separately, but for the convenience of direct use embedding _rnn_seq2seq can be. The Tf.unstack function is used to expand the encoder_inputs and decoder_inputs into a list, num_encoder_symbols and num_decoder_symbols correspond to the number of words we have. Embedding_size is the number of our embedded layer, feed_previous This variable is very important, set to false that this is the training phase, the training phase will use decoder_inputs as one of the decoder input, but feed_ When previous is true, the prediction phase is represented, and the prediction phase is decoder_inputs, so you can rely on the output of the decoder at the moment as input at the current moment.
Logits = Tf.stack (results, Axis=1)
loss = Tf.contrib.seq2seq.sequence_loss (logits, Targets=targets, weights= weights)
pred = Tf.argmax (logits, axis=2)
train_op = Tf.train.AdamOptimizer (learning_rate=learning_rate). Minimize (loss)
1 2 3 4
Then use Sequence_loss to create the loss, which calculates the loss based on the output of the EMBEDDING_RNN_SEQ2SEQ, which can also be used to make predictions, with the largest value corresponding to the index that is the vocabulary word, the optimizer uses the thing Adamoptimizer. Creating a session
With TF. Session () as Sess:
ckpt = tf.train.get_checkpoint_state (model_dir)
if Ckpt and Ckpt.model_checkpoint_path:
Saver.restore (Sess, Ckpt.model_checkpoint_path)
else:
sess.run (Tf.global_variables_initializer ())