Build a chat robot with deep Learning Network (ii) _ Depth Learning

Source: Internet
Author: User
Tags benchmark idf nltk
WHO/why focus on the robot based on the retrieval model.

In a previous blog in this series, the robot based on the retrieval model has an answer set (repository) that contains a number of predefined answers. The resulting model, which corresponds to the model, produces a completely new answer without resorting to any answer set.

Let's more formally define a robot based on the retrieval model: the input of the model is the context $c $ and the Answer (response) $r $. The model will be graded according to the context, and the highest rated answer will be chosen as the output of the model.

You have to ask why we want to build a model based on retrieval, given the ability to build a production model. Admittedly, the production model looks more flexible and does not require a predefined set of answers. The reason is simple: the present model of production is poorly performed in practice. Because they are so flexible that they are very apt to make grammatical mistakes, to produce and question unrelated answers, to produce a panacea answer or to an inconsistent answer (we have briefly discussed these issues in (a)). In addition, the production model requires a lot of training data. Today, most industry-mainstream systems are still based on the retrieval model or the combination of the two models. The production model is an active research area, but we are just getting started. If you want to build a chat robot now, the model based on the retrieval should make you feel more fulfilled:) Ubuntu Dialogue Corpus

In this blog, we will use the Ubuntu dialog corpus (Ubuntu Dialog Corpus) (Paper,code). The Ubuntu Dialog Corpus (UDC) is a dialog log based on the Ubuntu channel and is one of the largest open conversation datasets.

This article has gone into the details of how this dataset was created, so this article is not going to repeat it. But we can get a rough idea of the structure of the dataset, which is convenient for us to use in the model.

The training dataset consisted of 1 million samples, each of which accounted for half of the positive and negative samples. Each sample consists of a context and answer (utterance). Context refers to the beginning of the conversation, to the current content, the answer refers to the content of the response. In other words, the context can be a number of dialogues, and the answer is a response to a number of dialogues. A positive sample means that the context of the sample and the answer is a match, and, correspondingly, the negative sample refers to the mismatch between the two-the answer is taken randomly from somewhere in the corpus. The following figure is a partial display of the training dataset:

You will find that these samples look a little strange. In fact, it is because the script that produces the dataset uses NLTK to do a series of data preprocessing for US-participle (tokenized), the English word takes root (stemmed), the English Word Transformation Classification (lemmatized) (for example, single plural classification) and so on. In addition, such as people's names, place names, organization name, URL link, system path and other proper nouns, NLTK also made an alternative. These pre-processing work is not a must, but they seem to make the results better: Statistically, the average length of the context is about 450 characters, and the average length of the answer is approximately 80 characters.

The script that produces the dataset can also generate a test dataset (see the following figure). In the test dataset, each record includes: a context, 1 true answers, and 9 wrong answers. The goal of this model is to have the highest score for a true answer and a low score for the wrong answer (so that the model can pick the right answer.) )

Introduction of the data set, roughly speaking of the evaluation model good or bad method. There are many evaluation methods that can be used, and the most commonly used method is called $recall@k$. What does that mean. The model will choose K to answer according to the rating from high to low. If the correct answer is in the K, we think the test sample is correct. Obviously, the bigger the K, the simpler the thing. For the test set just introduced, if the $k=10$, the classification accuracy rate is 100%, because all the answers are selected, the correct answer must be in it. Correspondingly, if you make $k=1$, the model has only one chance to choose, which requires a high degree of precision in the model.

Here, I want to mention the specificity of the dataset and the difference from the real data. For the dataset, the robot model scores different answers each time, and in the training phase some of the answers may only be met once. This means that the robot has a better generalization ability to perform well in the face of many never-seen answers in the test set. However, in many reality systems, robots only need to deal with a limited number of responses, that is, the training set, each possible answer, there will be several samples corresponding to this. As a result, robots are not asked to rate answers they have never seen before. It's a lot simpler. Therefore, the actual robot based on the retrieval model should be better than the model. A few simple baselines (baseline)

Before introducing a more sophisticated model of depth learning, let us once again clarify our mission and build a few simple baseline models. This helps to understand how much we can expect from our model:)

We will use the following function to evaluate our $recall@k$ metrics:

# Evaluation
def evaluate_recall (y, Y_test, k=1):
    num_examples = float (len (y))
    num_correct = 0
    for Predictions, label in Zip (Y, y_test):
        if label in Predictions[:k]:
            num_correct + 1 return
    num_correct/num_ Examples

Here, $y $ is the list of predicted values after the sort, $y $_$test$ is the real label (label). For example, one of the $y$ is this: $[0,3,1,2,5,6,4,7,8,9]$, the answer to the number 0 was given the highest score, and the number 9 answered the lowest score. Because the sample of our test set has 10 answers, the number is 0~9. If $y$_$test=3$, that is, the correct answer to the answer number 3, and the evaluation criteria for $recall@1$, then this test sample will be labeled as error, conversely, if it is $recall@2$, then this sample is correct.

Intuitively, a completely random prediction model, in $recall@1$, the correct rate should be 10%, when $recall@2$, the correct rate should be 20%, and so on. Let's write a little program to verify:

# Random Predictor
def predict_random (context, utterances): Return
    Np.random.choice (len (utterances), 10, Replace=false)
# Evaluate Random Predictor
y_random = [Predict_random (TEST_DF. CONTEXT[X], test_df.iloc[x,1:].values) for x in range (len (TEST_DF))] for
n in [1, 2, 5,]:
    print ("Recall @ ({}, : {: G} ". Format (n, Evaluate_recall (Y_random, Y_test, N))
Recall @ (1): 0.0937632
Recall @ (2): 0.194503
Recall @ (5): 0.49297 Recall
@ (10, 10): 1

Very good. The result is the same as we expected. Of course, we are not satisfied with a stochastic predictive model. In that article (see above), another baseline, called the TF-IDF predictive model, is also discussed. TF-IDF refers to the word frequency-the reverse file frequencies (term frequency–inverse document frequency), which measure the importance of the words in a corpus. More details about TF-IDF we will not repeat (there are a lot of relevant information on the Internet), in a nutshell, similar content of the document has a similar TF-IDF vector. Intuitively, if a context and answer have similar terms, they are more likely to be a pair of matching combinations. At least this estimate will be more reliable than the random one.

Now, many libraries (such as Scikit-learn) have TF-IDF built-in functions, so it's not difficult to use. Let's build a TF-IDF predictive model and see how it behaves:

 class Tfidfpredictor:def __init__ (self): Self.vectorizer = Tfidfvectorizer () def train (self, data) : Self.vectorizer.fit (Np.append) (data. Context.values,data. utterance.values) def predict (self, context, utterances): # Convert context and utterances into TFIDF vecto 
        R vector_context = Self.vectorizer.transform ([context]) Vector_doc = Self.vectorizer.transform (utterances) # The DOT product measures the similarity of the resulting vectors result = Np.dot (Vector_doc, Vector_con Text. T). Todense () result = Np.asarray (result). Flatten () # Sort by top results and return the indices in descend ing order return np.argsort (result, axis=0) [:: -1] 
# Evaluate TFIDF Predictor
pred = Tfidfpredictor ()
pred.train (train_df)
y = [Pred.predict (TEST_DF. CONTEXT[X], test_df.iloc[x,1:].values) for x in range (len (TEST_DF))] for
n in [1, 2, 5,]:
    print ("Recall @ ({}, : {: G} ". Format (n, Evaluate_recall (y, Y_test, N))
Recall @ (1): 0.495032
Recall @ (2): 0.596882
Recall @ (5): 0.766121 Recall
@ (10, 10): 1

As you can see, the TF-IDF model behaves much better than a random model, but it's far from enough. In fact, the hypothesis we have just now is problematic: first, the appropriate answer is not necessarily the same as the context vocabulary; second, TF-IDF ignores the order of words, which is critical. Using models based on neural networks, we should be able to get better results. Dual coded LSTM model (DUAL encoder lstm)

In this section, we will build a dual-coded lstm depth learning model, also known as a Siamese network (Siamese network). This type of network is only one of the options for solving such problems, perhaps not the best. Of course, we can play the imagination, build a variety of in-depth learning framework to try-This is also the current research hotspot. So, why did we choose the dual-coded model? Because according to the result of this experiment, the model behaves well. And since we already have benchmark programs that we can reference (benchmark), we can have a reasonable estimate of the recurrence of the model. Of course, using other models (such as Attention-based's RNN model) can also be an interesting research point. We build the dual-coded RNN model structure as follows (paper):

It works in the following general principles:

The context and answer (response) are separated by words, and each word is replaced with a word vector. The word vectors we use are the glove of Stanford, and they are also tuned in the course of network training (there is no display word vector in the picture).

The context and answer will be in the rnn of the word, in the image ($c_i$ and $r_i$ can be considered as the word vector of a word). Then RNN produces a vector that can be considered to be roughly representative of the context and the "meaning" of the answer ($c$ and $r$ in the figure) we can specify the dimensions of the vector, assuming we now specify 256 dimensions.

We multiply the $c$ and a matrix $m$ and "predict" an answer ' $. If the $c$ is a 256-D vector, then the $m$ is set to the 256*256 dimension's matrix, so that the resulting "$" is another 256-dimensional vector. It can be thought that "' $ is the response that the context $c$ after the network. The Matrix $m$ will be studied during the training process.

We measure the similarity between the resulting answer "' $" and the true answer $r$. The method used is to dot product (dot product) operations on both. The larger the dot product results, the more similar the two are, so the current answer $r$ will get higher. We then use the sigmoid function to convert the dot product result to a probability value. The $σ (C^{T}MR) $ on the right side of the figure combines steps 3 and 4.

To train this network, we need to define a loss function (loss functions). We will use the two-yuan cross entropy (binary cross-entropy) commonly used in classification problems. We use $y$ to represent "context-answer" pair's real Callout (True label), $y $ either 1 (real answer), or 0 (wrong answer), and $y ' $ for the predicted probability value, $y \in[0,1]$. Then, the cross entropy loss value is calculated as follows: $L =−y * ln (y ') − (1−y) * ln (1−y ') $. The intuitive understanding of this formula is very simple. If $y=1$, then "=-ln (y ') $, then if $y ' $ is far away from 1, L's value will be very large, as punishment. If $y=0$, then "=-ln (1-y ') $, then the penalty $y ' $ is far from 0.

Our model implementations use the NumPy, pandas, TensorFlow, and TF Learn (also part of the TensorFlow, providing a number of easy-to-use functions)

Before the model is built, we need to define some parameters (Hyper-parameters):

# The maximum number of words to consider for the contexts
max_context_length = # The maximum number of
 
words t o Consider for the utterances
max_utterance_length =
 
# Word embedding dimensionality embedding_size
= 300< c5/># lstm Cell dimensionality
lstm_cell_size = 256

Limit the context and answer the length of the sentence to make the model train faster. According to the previous statistics on the dataset, 80 words are probably able to intercept most of the context, and accordingly, the answer to use 40 words is probably enough. We let the dimension of the word vector be 300, because the pre-trained good either Word2vec or glove are 300 dimensions, so it is convenient for us to use them directly.

Then we use TF Learn's library function to preprocess the data. Includes building word-index tables (vocab_processor), converting datasets from words to Indexes (index), and we also load glove word vectors and initialize the word index-word vector table (initial_embeddings): Set the DataSet, Words that exist in glove are replaced with glove word vectors, and words that do not exist are initialized to a uniform distribution between $ ( -0.25,0.25) $.

# preprocessing # ================================================== # Create Vocabulary mapping All_sentences = Np.append (TRAIN_DF. Context, TRAIN_DF. Utterance) Vocab_processor = Skflow.preprocessing.VocabularyProcessor (max_context_length, min_frequency=5) vocab_ Processor.fit (all_sentences) # Transform contexts and utterances X_train_context = Np.array (List (vocab_ Processor.transform (TRAIN_DF. context)) X_train_utterance = Np.array (List (vocab_processor.transform) (TRAIN_DF. utterance)) # Generate training tensor x_train = Np.stack ([X_train_context, X_train_utterance], Axis=1) Y_train = Train_ Df. Label n_words = Len (vocab_processor.vocabulary_) print ("Total words: {}". Format (n_words)) # Load Glove vectors # = = ============================================ Vocab_set = set (Vocab_processor.vocabulary_._mapping.keys ()) glove_ Vectors, glove_dict = load_glove_vectors (Os.path.join (Flags.data_dir, "Glove.840b.300d.txt"), Vocab_set) # Build Initial word embeddings # ================================================== initial_embeddings = Np.random.uniform ( -0.25, 0.25, (N_words, Embedding_dim)). Astype (" Float32 ") for Word, glove_word_idx in Glove_dict.items (): Word_idx = Vocab_processor.vocabulary_.get (word) initial _EMBEDDINGS[WORD_IDX,:] = Glove_vectors[glove_word_idx]

Before we build the model, we need to introduce an extension content first. Because in the actual test, the data set for "more interception, less 0" of the regularization method, may let us lose some precision. Imagine, if a sentence only 5 words, and was filled to 80 words, or a sentence has 150 words, but was intercepted to 80 words, anyway, are not good enough. However, as mentioned above, the data is cut to speed up the training process. Therefore, we need a trade-off. For the interception, we remain the same; for completion, we first calculate the original length of the data (that is, the last non 0 index) before sending the RNN network. Notably, it is possible to do so because the TensorFlow RNN module is trained to support variable-length input data. The function that gets the actual length of data that is not greater than the maximum length is defined as follows:

def get_sequence_length (Input_tensor, Max_length): ""
    "
    If A sentence is padded, returns the index of the the the the The 0 ( The padding symbol).
    If the sentence has no padding, returns the max length.
    "
    " " Zero_tensor = Np.zeros_like (input_tensor)
    comparsion = tf.equal (Input_tensor, zero_tensor)
    zero_positions = Tf.argmax (Tf.to_int32 (comparsion), 1)
    position_mask = Tf.to_int64 (tf.equal (zero_positions, 0))
    Sequence_ Lengths = Zero_positions + (Position_mask * max_length) return
    sequence_lengths

Next, we can start to build the model. The following operations are done in batch. The basic steps are as follows:

Call the Get_sequence_length function to get the context and the actual length of the answer, respectively;

Use the previously constructed word index-word vector table to replace the context and answer with the word vector;

The context and answer are fed into the same RNN network, and the last state of the RNN network is taken as context and answer encoding;

predict, calculate probability value and loss.

These steps correspond to the previous diagram one by one, and the code is as follows:

def rnn_encoder_model (X, y): # Split input tensor into the separare context and utterance tensor context, utterance = Tf.split (1, 2, X, name= ' split ') context = Tf.squeeze (context, [1]) utterance = Tf.squeeze (utterance, [1]) Utte rance_truncated = Tf.slice (utterance, [0, 0], [-1, Max_utterance_length]) # Calculate the sequence LENGTH for RNN cal Culation context_seq_length = get_sequence_length (context, max_context_length) Utterance_seq_length = get_sequence _length (utterance, max_utterance_length) # Embed context and utterance into the same spaces with Tf.variable_scope
        ("Shared_embeddings") as VS, Tf.device ('/cpu:0 '): Embedding_tensor = Tf.convert_to_tensor (initial_embeddings) embeddings = tf.get_variable ("word_embeddings", Initializer=embedding_tensor) # Embed the context Word _vectors_context = Skflow.ops.embedding_lookup (embeddings, context) Word_list_context = Skflow.ops.split_squeeze (1 , Max_context_length, WoRd_vectors_context) # Embed the utterance word_vectors_utterance = Skflow.ops.embedding_lookup (embeddings, utterance_truncated) word_list_utterance = Skflow.ops.split_squeeze (1, Max_utterance_length, Word_vectors_utteran

        CE) # Run context and utterance through the same RNN with Tf.variable_scope ("Shared_rnn_params") as vs: #lsy modified the Forget_bias = 2.0 cell = Tf.nn.rnn_cell. Lstmcell (Rnn_dim, forget_bias=2.0) cell = Tf.nn.rnn_cell. Dropoutwrapper (cell,output_keep_prob=0.5) context_outputs, context_state = Tf.nn.rnn (cell, Word_list_ Context, Dtype=dtypes.float32, sequence_length=context_seq_length) Encoding_context = Tf.slice (Context_state, [0,
            Cell.output_size], [-1,-1]) vs.reuse_variables () utterance_outputs, utterance_state = Tf.nn.rnn ( Cell, Word_list_utterance, Dtype=dtypes.float32, sequence_length=utterance_seq_length) encoding_utterance = TF . Slice (UtterancE_state, [0, Cell.output_size], [-1,-1]) with Tf.variable_scope ("prediction") as Vs:w = Tf.get_variable ("W"
                            , Shape=[encoding_context.get_shape () [1], Encoding_utterance.get_shape () [1]], Initializer=tf.random_normal_initializer ()) b = tf.get_variable ("B", [1]) # We can interpret This is a ' generated context ' Generated_context = Tf.matmul (Encoding_utterance, W) # Batch Multiply Contex  TS and utterances (Batch_matmul only works with the tensors) Generated_context = Tf.expand_dims (Generated_context, 2) Encoding_context = Tf.expand_dims (Encoding_context, 2) scores = Tf.batch_matmul (Generated_context, ENC Oding_context, True) + B # go from [15,1,1] to [15,1]: We want a vector of scores scores = Tf.squeeze (s
        Cores, [2]) # Convert scores into probabilities probs = Tf.sigmoid (scores) # Calculate loss Loss = Tf.contrib.losSes.logistic (scores, Tf.expand_dims (Y, 1)) tf.scalar_summary ("Mean_loss", Tf.reduce_mean (loss)) return [probs , loss]

After defining the model functions, we call the tf learn method to wrap the model functions, and we can also set some examples such as optimization (Optimazer), Learning rate (learning rate), learning rate attenuation functions (learning rate decay function) and other parameters. Then, when the classifier starts, it's ready to start training.

def evaluate_rnn_predictor (DF): Y_test = Np.zeros (len (df)) y = Predict_rnn_batch (df. Context, df.iloc[:, 1:].values to N in [1, 2, 5, ten]: print ("Recall @ ({}): {: G}". Format (n, Evaluate_rec All (y, Y_test, N)) class Validationmonitor (tf.contrib.learn.monitors.BaseMonitor): Def __init__ (self, Print_steps=1 Early_stopping_rounds=none, Verbose=1, val_steps=1000): Super (Validationmonitor, self). __init__ (P Rint_steps=print_steps, Early_stopping_rounds=early_stopping_rounds, verbose=verbose) self . val_steps = Val_steps def _modify_summary_string (self): if self.steps% self.val_steps = = 0:eva
        Luate_rnn_predictor (VALIDATION_DF) def learning_rate_decay_func (global_step): Return Tf.train.exponential_decay ( Flags.learning_rate, Global_step, Decay_steps=flags.learning_rate_decay_every, Decay_rate=fla

Gs.learning_rate_decay_rate, Staircase=true)Classifier = Tf.contrib.learn.TensorFlowEstimator (Model_fn=rnn_encoder_model, N_classes=1, continue_training= True, Steps=flags.num_steps, Learning_rate=learning_rate_decay_func, Optimizer=flags.optimizer, batch_size =flags.batch_size) monitor = Validationmonitor (print_steps=100, val_steps=1000) classifier.fit (X_train, Y_train, Logdir= './tmp/tf/dual_lstm_chatbot/', monitor=monitor)

With regard to the testing function of the test set, we can make predictions easily by invoking the Predict_proba function of the classifier we have defined well. Of course, before that, you also need to convert data from words to indexes. In fact, testing and training can almost share a previously defined model, with some other differences that classifier will help us deal with.

def predict_rnn_batch (contexts, utterances, n=1):
    num_contexts = len (contexts)
    num_records = np.multiply (* Utterances.shape)
    input_vectors = []
    for context, utterance_list in zip (contexts, utterances):
        Cvec = Np.array (Vocab_processor.transform ([context])) for u in
        utterance_list:
            Uvec = Np.array (List (vocab_ Processor.transform ([u]))
            stacked = Np.stack ([Cvec, Uvec], Axis=1)
            input_vectors.append (stacked)
    Batch = Np.vstack (input_vectors) Result
    = Classifier.predict_proba (Batch) [:, 0] Result
    = Np.split (result, num_contexts) return
    np.argsort (result, Axis=1) [:,::-1]

The code framework is roughly here, and if you're interested in this search based dialog system model, you might want to experiment with it and have access to the original author's GitHub for more information.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.