Question and Answer system (QA) 0

Source: Internet
Author: User
Tags ming sort knowledge base
The existing search system, whether it is restricted domain search or Internet search engine, is generally search based on keyword(1. More relevant answers 2. Poor intention expression 3. Language layer, without touching the semantic layer).

Frequently asked Questions,faq, which is to return the sorted answer by extracting the problem characteristics for the similarity calculation


Problem Resolution: Mainly includes word segmentation, pos tagging, syntactic analysis, named entity recognition, problem classification, problem extension, etc.participle: There is a big difference between Chinese and English participle, the English word is a space as a natural delimiter, and the language is the basic writing unit, the words have no obvious distinction between the mark.The most common method of Word segmentation is rule-based dictionary matching., when there are ambiguous participle, there are the largest segmentation (forward, backward, back and forth), the least slice, the whole segmentation strategy, but there are some shortcomings. In the restricted field, we need to construct our own domain dictionaries to improve the accuracy of word segmentation.
1 word breaker
2, ANSJ word breaker
3, mmseg4j word breaker
4, Ik-analyzer word breaker
5, Jcseg word breaker
6. FUDANNLP word breaker [Fudan University]
7, SMARTCN word breaker
8, Jieba word breaker
9, Stanford Word breaker
10, HANLP word breaker
From the speed, participle effect, limit the field of segmentation effect has not been tested.
Pos Labeling: Stanford-postagger Chinese pos tagging is better
Syntax Analysis: Syntactic analysis refers to the formal definition of the grammatical structure of natural language, which can be divided into phrase structure grammar and dependent grammar.
named entity recognition: The task of naming entity recognition is to identify three categories (entity class, Time class, and number Class) in the pending text, seven small classes (person, institution, place, time, date, currency, and percentage) named entities.
Problem Classification: The use of semantic dictionaries such as WordNet (22), Hownet (23), and so on to the problem of the upper and lower words, synonyms to expand. The problem classification is mainlya method based on pattern rule matching and a method based on statistical learning, in which machine learning approaches dominate. The advantage of pattern matching is that there is no corpus, no manual labeling of error rates and workloads, and can also guarantee good results. However, due to the flexibility of Chinese expressions, many questions do not even contain interrogative words, so the rule method is not adaptable.
In the Chinese field:
1. A new method of feature extraction, this method relies on the results of syntactic analysis, by the main words and interrogative words and their subsidiary components as the characteristics of bn input.
2. Using the Hownet semantic dictionary, the interrogative sentence, the syntactic structure, the interrogative intention word in the knowledge net the Shouyi original is the characteristic input, uses the EM classifier.
3. The question of extracting questions, the main meaning of the key core words, the first sememes of the core keywords, the main semantics of the main predicate, the named entity, the plural of nouns and so on six kinds of features, using the SVM classifier to carry on the classification contrast of the different feature combinations of the factual interrogative sentences.
Problem extension: At present there are two main ways, one is through the search engine and other external text extension, or with the help of knowledge base such as WordNet or Wikipedia, mining the inner link between the words.
The question is how to express it after the above processing.
The problem resolution process is particularly important in the knowledge base-based question and answer system, there are two main methods, one is based on the symbolic representation method, the other is based on the deep learning distributed representation method.
The representation of questions is represented as formal query form, such as logical expression, LAMBDA, calculus, Dcs-tree or FUN-QL, and then translated into corresponding query language such as SQL, SPARQL, Prolog, FUNQL and so on.
Information Retrieval: Information retrieval takes the result of the problem resolution module as input and returns a series of related sorted documents from the underlying knowledge base. There are Boolean models, vector space models and probabilistic models for the retrieval of commonly used models.
1. The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Its query consists of the join character and, or, and not, and returns several related documents to the user by taking the intersection, the set, or the complement of the inverted index corresponding to each keyword.
Example: Here are 2 documents:
Document 1:A b c F g H;
Document 2:A f b x y z;
The user wants to find a document (triples) that appears a or b but must appear Z. Of course, we looked closely, and the result is clearly that document 2 meets the needs of the user. But how is it implemented for the computer Boolean sort model? The query is represented as a Boolean expression q= (a∨b) ∧z, and converted into a disjunction paradigm qdnf= (1,0,1) ∨ (0,1,1) ∨ (1,1,1) (triples); The corresponding values for triples in document 1 and document 2 are (1,1,0) and (1,1,1) respectively; after matching, document 2 is returned;
The advantages of the Boolean model: by using complex Boolean expressions, it is easy to control the query results.
Boolean model problems: 1. Partial matches are not supported, and exact matches result in too much or too little, very rigid: "and" means all; 2. Difficult to control the number of documents retrieved in principle, all matching documents are returned; 3. It is difficult to sort the output;
2. The vector space model is the basis of the current text retrieval system and the web search engine, which represents the document and the user's query as points in the vector space, and uses the cosine of the angle between them as a measure of similarity.
Example: If the document has a K-word, the weight of K words in document J. Assume that the document collection size is n,fij as the number of times that word I is in a document J.

3. Probabilistic retrieval models usually use keywords as clues to get the probability of each keyword appearing and not appearing in the relevant document set and the probability that it will appear and not appear in the set of documents that are not related to the query, and finally calculate the similarity of questions and documents based on these probability values.
Principle: R: Related Document Set NR: Unrelated Document Set Q: User query DJ: Document J
PRP (probability ranking principle): A probabilistic sequencing principle that uses probabilistic models to estimate each document and demand-related probabilities, and then sorts the results.
Bayesian optimal decision-making principle, based on the minimum loss risk to make a decision, return the relevant possibility is greater than the likelihood of the document;
Formula for conditional probabilities: P (AB) =p (A) p (b| A) =p (B) P (a| B

The Bayesian formula is deduced from the conditional probability formula: P (b| A) =p (a| b) p (b)/P (A)


Answer Extraction :
1) Pattern matching: For example, "where Yao Ming was born", the answer is usually "Yao Ming comes from Shanghai", then
The answer pattern can be set to "< people" Born in "Location >".
2) Relationship pre-definition: An effective way to realize the relationship extraction is to convert the sentence into ternary form consisting of subject, predicate and object, then extract the answer from it.

3) Semantic similarity: For two words W1 and W2, the similarity is defined as SIM (W1, W2), the word distance is dis (W1, W2), α is an adjustable parameter, then there is


Define two sentences A and B, where a contains A1, A2, A3 、......、 am a total of M words, B contains B1, B2, B3 、......、 plus a total of n words, defines the similarity between the words for SIM (AI, BJ), then any two word similarity:







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.