The automatic question answering system is a very hot direction in the current natural language processing field. It uses the technology of knowledge representation, information retrieval and natural language processing synthetically. Automatic question and answer system can make the user in the form of natural language questioning rather than the combination of keywords, put forward information query requirements, the system based on the analysis of the problem, from a variety of data resources automatically find accurate answers. From the system function, automatic question and answer is divided into open domain automatic question answering and limited domain automatic question and answer. Open domain refers to the problem domain is not limited, the user arbitrarily ask questions, the system from the vast numbers of data to find answers; Limited domain refers to the system in advance, can only answer a certain area of the problem, other areas of the problem can not be answered.
In order to test the feasibility of this aspect, recently, the use of Baidu know the relevant question and answer corpus, tested the next.
Specific steps:
(1) Data pre-processing: The original data that Baidu knows through preprocessing into the format specification of the data into the database, easy to follow up, forming the training data required for the original data set.
(2) Build the classifier: using the data to train the text classifier model, when the user proposed test problems can be put on the test questions category tags, lock the knowledge of the answer range;
(3) Similarity search: The test problem and other problems under the same category in the training corpus are computed in text similarity, and the problem of higher similarity is found as a set of similar problems.
(4) Answer extraction: Sort all the answers in the collection of similar questions and select the best answers to feedback to the user.
Inside the core technology is the construction of the classifier, because there is no deep learning method, currently only using the SVM classifier to test, found that it is still feasible. and similar problems to calculate this, there are a lot of ready-made stuff.
Implemented in Java code, the test results are as follows: