Deep Learning for Chatbots, section 2–implementing a retrieval-based Model in TENSORFLOW__NLP

Source: Internet
Author: User
Tags jupyter notebook
retrieval-based Bots

In this post we ll implement a retrieval-based bot. retrieval-based models have a repository of pre-defined responses they can use, which are unlike generative models that can generate responses they ' ve never seen before. A bit more formally, the "input to a" retrieval-based model is a context (the conversation up to this point) and a potentia L Response. The model outputs is a score for the response. To find a good response your would calculate the score for multiple responses and choose the one with the highest.

But why would you want to build a retrieval-based model if you can build a generative model? Generative models seem more flexible because they don ' t need this repository of predefined responses, right?

The problem is that generative models don ' t work the down in practice. At least not yet. Because they have so much freedom into how they can respond, generative models the tend to make grammatical mistakes and produce Irrelevant, generic or inconsistent responses. They also need huge amounts of training data and are to hard. The vast majority of production systems today are retrieval-based, or a combination of retrieval-based and generative. Google ' S smart Replyis a good example. Generative models are an active area of the, but we ' re not quite there. If you are want to build a conversational agent today your The best bet is most likely a retrieval-based model. the Ubuntu Dialog Corpus

In this post we'll ' ll work with the Ubuntu Dialog Corpus (paper, GitHub). The Ubuntu Dialog Corpus (UDC) is one of the largest public Dialog datasets. It ' s based on chat logs to the Ubuntu channels on a public IRC network. The paper goes into detail in how exactly the corpus is created, so I won ' t repeat. However, it's important to understand what kind of data we ' re working with, so let's do some exploration.

The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0). Each example consists of the "a", "conversation up" to "this" point, and "a utterance," a response to the "context." A positive label means that utterance is an actual response to a, and a negative label means the Utteranc E wasn ' t–it is picked randomly from somewhere in the corpus. This is some sample data.

Note that the DataSet generation script has already do a bunch of preprocessing for Us–it has, tokenized, and Lemmatized the output using the NLTK tool. The script also replaced entities like names, locations, organizations, URLs, and system paths with special tokens. This is preprocessing isn ' t strictly necessary, but it ' s likely to improve performance by a few. The average is words long and the average utterance is words long. Check out the "Jupyter notebook to" the data analysis.

The data set comes with test and validations sets. The format of this is different the training data. The test/validation set consists the a context, a ground truth utterance (the real response) and 9 incorrect Utterances called distractors. The goal of the model is to assign the highest score to the true utterance, and lower scores to wrong.

The are various ways to evaluate how we model does. A commonly used metric is recall@k. Recall@k means that "we let the" model pick the K best responses out of the the possible responses (1 true and 9 distractors) . If the correct one is among the picked ones we mark that test example as correct. So, a larger k means the task becomes easier. If We set k=10 we get a recall of 100% because we have have to responses from. If we set k=1 the model has only one chance to pick the right response.

At this point you are wondering how the 9 distractors were chosen. In this data set the 9 distractors were picked at random. However, in the "real world", have millions of possible responses and you don ' t know which one-is correct. Can ' t possibly evaluate a million potential responses to pick the one with the highest score–that ' d be too . Google's Smart Reply uses clustering techniques to come up and a set of possible responses to choose. Or, if you are only have a few hundred potential responses into total you could just evaluate all of them. baselines

Before starting with fancy neural network models let ' s builds some simple baseline models to help us understand what kind o F Performance we can expect. We ' ll use the following function to evaluate our recall@k metric:

def evaluate_recall (y, y_test, k = 1):      num_examples = float ( Len (y))      num_correct = 0      for predictions, label in Zip (Y, y_test):

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.