Chinese word segmentation and search engine

Last Update:2016-01-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

See the topic will know what I want to say, this topic seems to have been discussed N times, look at Yahoo search Blog in 06 There has been a series of articles, address: http://ysearchblog.cn/2006/07/post_16.html, This paper introduces the meaning of Chinese word segmentation in detail, the algorithm, the relationship with search engine and so on. Personally think the article quality is very good. In fact, I have written nothing more than these things, but why do I have to write it? Because I spent nearly a week to understand Chinese participle, collect information, in order not to let the effort in vain, I still summarize it.

A Why Chinese participle?

Yes, why should participle, not participle line? To discuss this topic, we need to understand the principle of search engine first. Why can the search engine quickly retrieve their own query keywords? This is actually due to his data storage mechanism-----Inverted index. Here is an example of what is an inverted index (which is largely because I don't have a deep understanding). Suppose I have 10 articles that they may have discussed the same or different topics. If I want to see which article contains the word "Chinese word", I can do this: loop through each article to see if his content contains the word "Chinese word", and then return the article containing the target word. Obviously, I need to open 10 articles, and traverse every article from beginning to end, see whether can match to "Chinese word segmentation", such efficiency is very low, for the millisecond level of search engine is absolutely unacceptable. So I want to give each article a "catalogue"! Not you want to query "Chinese word"? OK, I have found the article containing "Chinese participle" in advance. If the article 1,3,5,7 contains this word, the article 2,4,6,7 contains "search engine", good, I set up a corresponding relationship table "Chinese word breaker-->1,3,5,7", "Search engine-->2,4,6,7", so when you want to search for "Chinese word segmentation" the word, I no longer open each article to match, but directly to the corresponding relationship table to see the "Chinese word" corresponding to the article 1,3,5,7, good, the results came out, the article 1,3,5,7 contains "Chinese word" the word, the same search "search engine", the result is 2,4,6,7. Then I want to search both "Chinese word segmentation" and "Search engine"? That is (1,3,5,7) and (2,4,6,7) take intersection! The result is that article 7 also contains "Chinese word segmentation", "Search engine". What about the other words I'm searching for? Then put the other words in the corresponding table as well. This "correspondence table" is actually the so-called inverted index, of course, the inverted index may contain more information, such as not only contains the word in which article, but also contains in this article where and so on. Obviously, we need to create an inverted index of all the articles. The question comes, how does the computer know which is the word? He did not know that "Chinese participle" is a word. No words, how to build an index? So, we need Chinese participle! And participle occurs when user queries and servers are indexed.

Two What kind of Chinese word segmentation is suitable for search engine

Because each sentence in Chinese is a sequential sequence of words that have no obvious separation from the words (the words are separated from each other in English), the participle appears to be a lot more difficult. Take one of the most extreme ways, each word as a word, the so-called 1-dollar participle. For example, "Shandong Economic College", divided into "Mountain _ east _ _ Chi _ _ School", and then set up inverted index:

Mountain-->1,3

East-->2,3

by-->3,4

Ji-->3,5,4

Hospital-->3,5,6

So, when I search "Shandong Economic College", in the inverted index to find each word, and then take its intersection, the result is 3, article 3 contains the word "Shandong Economic College". This looks perfect, completely solves the problem of the word breaker and successfully establishes the inverted index. Obviously, the profound Chinese can not make the problem so simple! If I follow a unary participle to "motherboard and server", set up an inverted index when I want to retrieve Japanese "kimono". The result of the "motherboard and server" unrelated. It is foggy to let people feel uncomfortable! In addition to the quality of the search, there is another problem is the efficiency of the search, "Shandong Institute of Economics" I need to query 6 directories, and then take the intersection. This performance is reduced, for the millisecond-level search engine, how can this endure it. If "Shandong Institute of Economics" is a word, establish index "Shandong Economic College-->3" I did not find it all at once, nor do I have to do any intersection operation. So the 1-yuan participle obviously can't meet people's demands. Similar to the 2-dollar participle, 2 yuan cross participle, such as "Motherboard and server", 2 yuan participle for: "Motherboard _ Kimono _ Service", 2 yuan Cross participle for "motherboard _ board and _ Kimono _ services _, it is obvious that the two results in addition to improve the efficiency of the search, but there is no big improvement of the searching experience, It always makes me find a bunch of things about computers when I look for "kimonos".

is the "Shandong Institute of Economics", mentioned above, a perfect strategy for a word? Here reflects the search engine of Chinese word segmentation and other professional fields of Chinese participle of different, if it is machine translation, it is obvious that "Shandong Economic College" is a proper term, indivisible. But for the search engine, I hope that when I query the "School of Economics" can also be found in Shandong Economic College, so that "Shandong Economic College" can not be a word.

Therefore, the search engine participle should control the reasonable granularity!

The question is, how do you calculate the appropriate granularity? I think it is the smallest word that can express the full meaning, some mouthful, explain: "Shandong Economic College", can be divided into "Shandong _ Economic College" or "Shandong _ economy _ College", I tend to be the latter, because "economic college" can be divided: "Economics _ College". "Economics" and "college" are understandable words that can express the full meaning and cannot be divided. This guarantees the recall: "School of Economics", "Shandong _ College" can retrieve this result. As for the degree of satisfaction of the search need to do a correlation ranking, with the more relevant search words in front, such as the search for "Shandong Economy", an article on the economic development of Shandong Province should be in the comparative position, and the Shandong Economic College article should be ranked behind.

Three Chinese Word segmentation algorithm

There are three different methods of word segmentation in Chinese word segmentation, each of which corresponds to a research field.

1. The word segmentation based on dictionary

2. The word segmentation based on statistics

3. Syntax-based participle

What I know, and the current search engine widely used is the first word segmentation method, there is a lot of words in the dictionary, you take a word next to the dictionary to match, match on even a word. For example, "Shandong Economic College", first match "mountain", the dictionary has, OK, calculate a word, "Shandong", the dictionary also has, that is also a word, suppose "Shandong economy" also contains in the dictionary? I also match to "Mountain", "Shandong", "Shandong Economy" three words, which one should I choose? To solve this problem, people have made some rules: choose the word with the biggest length, because practice proves that it is often correct (of course, not absolutely correct). According to this rule, the algorithm has the forward maximum matching, inverse maximal matching, mmseg algorithm and so on. All of the above algorithms are not perfect, there will always be the wrong time, we just want to be as correct as possible (in fact, for the search engine, sometimes incorrect segmentation does not affect the search effect).

The following is a "study of the Origin of life" as an example of the above three kinds of algorithms of the word segmentation:

First of all, we need a dictionary that contains a few words in the dictionary:

Study

Graduate

Life

Origin

1. Forward Maximum match

As the name implies, the direction is forward to match the past.

First, get the first word "research", Query the dictionary, no, before further, get the "study", Query the dictionary, get the word: "Research", do not give up, continue to get "graduate", Query dictionary, get the word: "graduate student", still not stop, get "research life", Query dictionary, no, continue, Always hit the length you think is reasonable, such as you think a word can not more than 5 words, you get "study life", Query the dictionary, no results, it is not going on. Now the query results are selected, it is clear that the largest length of the "graduate student", so far, we have identified the first word "graduate student".

Then, starting from the next word, in order to get "life", "Life", "Life origin", to the dictionary query, there is no result, so the word "life" as a word.

Again, starting from the next word, get "up", "origin", to the dictionary query, get "origin".

At this point the completion of the word, get the result "graduate _ life _ Origin", very clearly divided the wrong ~.

2. Inverse Maximum Match

It's basically the process, but "studying the origins of life" is going to come here.

First of all, get the "source", "origin", "Life origin", "Origin of Life", "investigate the origin of life" to the dictionary match, get the result "origin".

Then, to get the "life", "Life", "Life", "research life", to the dictionary query, get the result "life".

Once again, get the "research", "research", Query the dictionary, get the result "study"

At this point, the completion of the word, the results of "research _ life _ Origin", the correct participle.

Generally speaking, it is generally believed that the correct rate of the inverse maximum match is higher than the forward maximum match.

3. MMSEG algorithm

The MMSEG algorithm is more complex than the above two algorithms, because the correct rate is relatively high, so it is widely adopted, because there are various language implementations, such as Python pymmseg, Java mmseg4j, C + + LIBMMSEG, etc.

Online about this algorithm has a lot of introduction, in order to lazy, I do not repeat here, want to know the students please Google.

Four Problems faced by Chinese word segmentation

The current Chinese word segmentation mainly faces two problems:

1. Ambiguity resolution

2. New word Discovery

About Ambiguity resolution:

For example, "research life", is to be divided into "research _ Life" or "Graduate _ Life", this is about the "health" Attribution problem, "health" can be with "research" together, become "graduate student", also can with "life" together, become "life". "Graduate student" and "life" have a intersection "life", this is the cross-type ambiguity. Another example of "Chinese", can be divided into "Chinese _ people", or "Chinese" is a whole word. That is, the different combinations of "China" and "people" cause ambiguity, which is the combination ambiguity

The above algorithms are nothing more than the dissolution of these two ambiguities.

According to the point I mentioned in the second article, the granularity of search engine segmentation should be: the smallest word that can express the full meaning. Therefore, there should be no combinatorial ambiguity, the emphasis should be put to solve the "cross-ambiguity" problem.

The resolution of ambiguity should be adjusted according to special circumstances. Can be combined with different word segmentation methods, such as grammar analysis, statistics and so on.

For example: "Wuyi Mountain Road", the dictionary contains "Wuyi", "Mountain Road" two words, the mmseg algorithm and the inverse of the maximum match will be participle: "Wuyi _ Mountain Road", when the user input "Wuyi Mountain Road", whether the correct word or not, can correctly match to the correct results. However, people living in this city, may want to through the "Wuyi Mountain" can be retrieved to Wuyi Mountain Road (because we all know that the city does not have Wuyi mountains, enter the "Wuyishan", the default is to want the result is about Wuyi Mountain). Now, why is it wrong to see participle? Because the word "mountain Road" exists in the dictionary! Is "road" this only suffix role of the word interference, so I think this suffix word, should not participate in participle, similar to the "Province, city, county, town," and so on.

About new word discovery:

Because of the limited energy, currently do not understand the situation in this area, can be generally through the user search log for data mining.

Five Summarize

1. The search engine needs Chinese word segmentation, the object of the participle is the document to be indexed and the query word submitted by the user, and the two must be consistent in the same sentence.

2. The granularity of word segmentation should be controlled as follows: the smallest word that can express full meaning. This avoids the ambiguity of the combinatorial type.

3. For resolving ambiguity should be adjusted according to the specific circumstances, appropriate comprehensive use of various word segmentation methods.

4. If the user query words with the establishment of the index word results consistent, even if the word error, can also be retrieved, so some people think that when the correct rate of segmentation to a certain value, the correct rate of the impact of the search quality will not be so obvious. So there is no need to blindly pursue the right rate.

Six Resources

1. Chinese word breaker and search engine (i)

2. Chinese word segmentation and search engine (II.)

3. Chinese word segmentation and search engine (iii)

4. Introduction of Chinese Word segmentation (Chinese word segmentation) in search engine

5. Google

Chinese word segmentation and search engine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Chinese word segmentation and search engine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Chinese word segmentation and search engine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support