The work of a basic search engine can basically be divided into the following three parts:
using web crawler to download Web pages, analysis of Web keywords, made index backup;
understand user input, determine search keywords;
lists search results by relevance sorted by keyword and page index.
The first part mainly involves the technology of network Crawler, graph theory, natural language processing and so on.
The second part mainly deals with natural language processing;
The third part also deals with natural language processing.
Natural language is the language that human beings use to communicate.
Thus, natural language processing (NLP, Natural Language processing) is a very important content of modern search engine, its ultimate goal is to convert natural language into a form of computer easy to deal with.
grammatical analysis and statistical model from the perspective of Word segmentation
Word segmentation is the basic problem that NLP needs to solve, the quality of Word segmentation algorithm directly affects the result of NLP.
Here we start with a simple example, and gradually explore a reasonable segmentation algorithm.
Start with a simple sentence.
Now there is a sentence, such as:
I went to computer city to buy a computer.
If you want the computer to do word segmentation, and then understand this sentence, you will have what kind of thinking.
Most people will first think about how they understand the sentence. For the Chinese, such a simple sentence may not require any special thinking process. The text form of the sentence and the meaning behind the sentence can be reflected in an instant. Readers with a slight knowledge of Chinese grammar may think: sentences can be divided into several parts
I-the subject
Went to computer city to buy a computer-the predicate
To the computer city-
the adverbial bought-the verb
a computer-verb object (noun phrase)
。 -Identification of end of sentence
Understand the meaning of each part separately
Put the meaning together and become a complete sentence
It first through the grammar analysis, divides the sentence into a two-dimensional syntax tree, then understands each part the meaning, finally does the stitching.
Such schemes (or algorithms) are based on grammatical rules, are clear and easy to implement (in the case of a computer, several loops are judged). For programmers, such algorithms are also particularly cordial. Because the syntax rules of the Advanced programming languages (such as C + +) that programmers use are very similar to this scenario.
Because such algorithms are intuitive and easy to implement, it is believed that people can solve the problem of natural language processing completely when they have more comprehensive grammatical generalization and more powerful computing ability.
the dilemma of grammatical analysis
However, if you look closely at the grammar analysis process, you will find that such a simple sentence is divided into such a complex two-dimensional tree structure, costing six comments. It is certainly not difficult to deal with such a process with a computer, but it is often not so easy to deal with real-life sentences that you encounter:
Understanding (Understanding) Natural language requires extensive knowledge about the outside world and the ability to manipulate it, natural language cognition, as well as an artificial intelligence (ai-complete) problem.
This sentence can still be dealt with in the above way:
Split the predicate part into the main predicate part first
Like what:
Natural language cognition-subject-partial positive phrase
natural language-noun as attributive modification
cognition-noun
because of understanding (understanding) Natural language, need to have extensive knowledge about the external world and the ability to manipulate these knowledge ... It is also regarded as an artificial intelligence (ai-complete) problem-predicate
because of the understanding (understanding) of natural language, the need for a wide range of knowledge about the outside world and the ability to manipulate these knowledge-reason adverbial ...
..... At the same time, it is also regarded as a predicate verb phrase-
adverbial is
also regarded as-predicate verb
an artificial intelligence complete (ai-complete) problem-Verb Object
A-attributive
artificial Intelligence complete-attributive
problem-Name Words
. -Sign of end of sentence
I didn't finish the parsing tree of this sentence because it was too complicated. It is obvious that a parser based solely on grammatical analysis is difficult to deal with real sentences in life.
So where is the problem? I think there are at least two problems.
The number of grammatical rules is enormous, and tens of thousands of grammatical rules only cover about 20% of the true sentences, and some of the grammatical rules and other rules for dealing with special situations contradict each other.
natural language is different from programming language, the specific meaning of vocabulary in natural language is related to context, while programming language has no such ambiguity.
From the point of view of algorithm complexity, the parser based on grammar analysis is used to analyze natural language, and its complexity is four magnitude higher than the analytical programming language. From an intuitive impression, it takes at least a minute for the above sentences to be processed in a modern computer by means of grammatical analysis. This inefficiency is unacceptable.
Look up the dictionary Word segmentation method
In previous grammatical analysis methods, participle depended on the result of grammatical analysis. The program must output the syntax tree before it can get the word segmentation result. And such a method has been proven to be inefficient.
This inefficiency comes from the complex process of grammatical analysis. In order to improve the efficiency, it is natural to think: whether there is a way to bypass grammar analysis, directly try participle. For Chinese participle, Professor Liangnanyum of Beihang University has put forward a dictionary segmentation method. The procedure is quite simple, for example, for the following sentences:
Shandong University School of Mathematics is one of the best base for basic mathematics in China.
We let the computer scan the whole sentence from left to right, each sweep to a word, to the dictionary query, encountered in the dictionary words are marked out. So the whole sentence is divided into this:
Shandong | university | mathematics | academy | is | china | best | mathematics | basic | education | base | one.
Looks like a good result. But the attentive reader will soon find that Shandong University and basic education are complete words that should not be divided between them. It is also not surprising that we require the computer to be scanned from left to right. When the computer encounters "Shandong" two words, think this is a word, nature will not go to look for the next word to seek a match. Empathy for basic education.
Professor Liang proposed a scheme that always searches for the longest possible participle. This is called "greed" in the field of computer science. Using greedy methods, the participle of the above sentence will become:
Shandong University | mathematics | academy | is | china | best | mathematics | basic Education | base | one.
There seems to be no problem.
However, Chinese is extensive and profound, this method cannot be once and for all. Like what:
University living Area
The correct participle should be:
University | Living Area
But according to Greedy way, will be participle into:
Student | live | district
That's not right.
Another example:
Developing countries
The correct participle should be:
developing | The country
Instead of:
Development | china | home
It can be seen that the method of dictionary search is very efficient, but there are mistakes, not reliable.
The dilemma of Dictionary search comes from the ambiguity of natural language. When people read natural language, they will combine context to determine the specific meaning of words with multiple intentions, but the computer does not have this ability. In fact, the "punctuation" in Chinese traditional literature is aimed at eliminating ambiguity through segmentation of words. So, how do you make computers capable of this?
A statistical model of the beginning of awaited
At this point, mathematics is finally going to show its power and beauty for the first time.
As we mentioned before, it is closely related to the ambiguity of a sentence to be correct or not to be a word. Because of the inability of the computer to judge the meaning of words in a comprehensive context and to solve two semantics, the method of dictionary search is in trouble.
There are so-called "contradiction" in mathematics. Here we do not speak the law of absurdity, but to talk about the idea of the absurdity of the law. The core idea of the counter-proof method is that "it is difficult to reverse": The positive breakthrough is very difficult, then do not take the road, open a backdoor to go to town. Here, since the computer does not have the ability to comprehensively solve the two semantics of vocabulary, then we do not rely on computer intelligence to solve, instead of the use of artificial power to solve. Of course, I'm not talking about a worker running a real-time intervention program to help the program make the right judgments, but rather to get the computer through a lot of text training to suck up human "word breaker experience". And this method is the statistical model.
Suppose a sentence SS can have several schemes of participle, such as the following three kinds:
A1,a2,a3,..., Aj (1)
B1,b2,b3,..., Bk (2)
C1,C2,C3,..., Cl (3)
Among them, A1, A2, B1, B2, C1, C2 and so on are all Chinese words. Thus, if (1) (1) is the best participle, then (1) (1) The probability of occurrence should be the largest. In other words, the word breaker (1) (1) should meet (4) (4).
P (a1,a2,a3,..., Aj) >p (b1,b2,b3,..., Bk)
(4)
P (a1,a2,a3,..., Aj) >p (c1,c2,c3,..., Cl)
The answer is so simple.
Of course, how to deal with (4) (4) requires a little statistical knowledge and skills; getting these word breakers also relies on the dynamic programming algorithm (otherwise the computational amount is too large), and details such as the particle size of the word segment need to be addressed. These are discussed in the following sections, where the reader only needs to know how to use the statistical method to deal with the word segmentation effect is good and efficient.
Summary
For Word segmentation, the method of statistical model is more efficient than the method of grammar analysis, and the effect is also better. the increase in efficiency here is very significant.
In addition, we find that the mathematical model behind a good algorithm is very concise and graceful . The statistical model only needs a probability inequality group to describe, and the grammar analysis model can hardly construct a readable mathematical model. When we design the algorithm, we should strive for simple and beautiful mathematical model, from simple and rough, and gradually perfect perfect. As Sir Isaac Newton said , "truth is always simple in form, not complex and vague ."
Finally, the grammar analysis method is very easy to think of, very natural treatment method, but this "nature" also makes people astray. This reminds us not to be stubborn, not superstitious experience.