Squad (Stanford question answering Dataset), initiated by Stanford University Zigan, is not unfamiliar with the text understanding Challenge, and it is also known as the "imagenet of the machine reading comprehension community". A number of research teams from the global academia and industry are actively involved in the recent machine reading comprehension has made a lot of breakthroughs, so these two days just have time to learn about some of the open source framework involved.
The squad dataset contains 100,000 (issues, the original, the answer) ternary group, the original text from 536 Wikipedia articles, and the construction of questions and answers is mainly through the way of crowdsourcing, so that the callout staff to raise up to 5 questions based on the content of the article and provide the correct answer, and the answer appears in the original. That is, given the document D, given the question Q, answer a is a text span in D.
In Microsoft's official article (from short to long, how computers learn reading comprehension), machine reading comprehension is one of the higher difficulty coefficients in natural language processing tasks, such as squad problem and answer has very rich diversity. These five questions may involve a person in the article, a location, or a time, and so on, and may ask some questions about why (Why) and how. The latter answer may actually be a sentence, or even a short sentence, so solving the problem is only trickier.
1. Basic principle
Today's practice algorithm is Microsoft's proposed r-net, its architecture is as follows:
The model is also divided into four layers.
(1) The bottom of the layer to do the presentation of learning, is to the problem and the text in each of the words to make a representation, that is, the vector in depth learning. The research group uses a multilayer bi-directional cyclic neural network. As can be seen from the graph, it uses word and char two kinds of embedding to enrich the input characteristics while doing embed, so that it can solve the relationship among the different expressions in some questions better.
(2) Compare the vectors in the problem with the vectors in the text so that you can find out which parts of the problem are close to each other.
(3) The results are compared in the global. These are achieved through the attention mechanism (attention).
in which (2) and (3) Two steps are gated convolution Network + attention mechanism to achieve, for each word in the original text, the attention distribution of the problem is calculated, and the attention distribution problem is expressed, the expression of the word and the corresponding problem is expressed by input RNN encoding, and the query-aware representation of the word is obtained. The difference is that R-net uses an extra door to filter unimportant information before the original word representation and the corresponding problem indicate input RNN.
(4) to predict each word in the chosen answer candidate area, which word is the beginning of the answer, and to which word is the end of the answer. In this way, the system picks out the most likely piece of text and finally prints out the answer. The whole process is an end-to-end system based on the above four levels of neural networks
2. Practice Test
Found an open source code https://github.com/unilight/R-NET-in-Tensorflow, to compile, the environment for TensorFlow 1.3, python3.6, adapt the source code to the python3.0 environment.
(1) Data preprocessing
Perform preprocess.py--gen_seq True
After the execution, you can see that the data directory has been preprocessed results.
(2) Training
Modify the code for training.
(3) test
This code is not disclosed by Microsoft, it is only a single model, the effect Exactmatch (EM) and F1 scores are 57% and 71%, respectively, from Microsoft's final integration model still a certain distance.