SWJ Small Talk: the function and outline of Chinese participle-sermon article!

Last Update:2014-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Everyone is still my Shanghai SEO (SWJ), some time ago there has been a netizen asked me about the content of Word segmentation, especially in Chinese word and then Baidu in the use of the front-end time SWJ wrote 2 articles on the technology of participle if you have not seen I recommend you see.

These 2 articles are:

1, what is Chinese participle? SEO optimization what help? "http://www.seo-sh.cn/seo/196.html

2, "In the application of SEO and participle between the role!" http://www.seo-sh.cn/zhishi/jishu/103.html

Next we will come to preach the way of detailed understanding of the next participle technology! The article is also on the network to see the SWJ made some changes and supplements!

With the rapid growth of information, so that search engines become the preferred tool for people to find information, Google, Baidu, Yahoo, the recent new NetEase Youdao and other large search engines have been the topic of discussion.

With the increasing value of the search market, more and more companies to develop their own search engine, Alibaba's business opportunities search, 8848 of shopping search and so on, naturally, search engine technology has become one of the hot spots of technical attention.

Search engine technology research, abroad than China, nearly ten years ago, from the earliest Archie, to the later excite, as well as Altvista, overture, Google and other search engines, search engine development has been more than 10 years of history, At the beginning of the century, the domestic research search engine began. In many areas, foreign products and technology are eminence, especially when a technology has been studied abroad for years and at home. For example, the operating system, word processing software, browsers, and so on, but the search engine is an exception. Although in the foreign search engine technology has long begun to study, but in the domestic still has emerged an excellent search engine, such as Baidu (Http://www.baidu.com), and recently just out of the Youdao (http://www.youdao.com) and so on. Currently in the field of Chinese search engine, the domestic search engine has and foreign search engine effect is not far. However, SWJ that its technical capabilities and other aspects of foreign advanced level has a certain distance however, this distance in the slowly narrowing! When it comes to search engine segmentation technology can form the present situation, there is an important reason is that the Chinese and English language of their own way of writing different.

What is Chinese participle?

As we all know, English is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning.

For example, English sentence I am a student, in Chinese is: "I am a student." Computer can be very simple to know student is a word, but it is not easy to understand the "learning", "Sheng" two words together to represent a word. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words.

I am a student, the result of participle is: I am a student.

Chinese word and search engine relationship and Impact!

What is the impact of Chinese word segmentation on search engines? For search engines, the most important thing is not to find all the results, because in the tens of billions of pages to find all the results do not have much meaning, no one can see, the most important thing is to put the most relevant results in the front, which is also called the relevance of the ranking. The accuracy of Chinese word segmentation often directly affects the ranking of the relevance of search results. The author recently for friends to find some information on Japanese kimono, in search engine input "Kimono", the results found a lot of problems.

Small Talk: Chinese word segmentation technology

Chinese word segmentation technology belongs to Natural language Processing technology category, for a word, people can through their own knowledge to understand what is the word, which is not a word, but how to make the computer also understand? Its processing is the word segmentation algorithm.

The existing segmentation algorithms can be divided into three categories: Word segmentation method based on string matching, Word segmentation method based on understanding and segmentation method based on statistics.

1. Segmentation method based on string matching

This method is also called the machine segmentation method, it is according to a certain strategy of the Chinese character string to be analyzed with a "full large" machine Dictionary of the terms of the match, if found in the dictionary a string, then matching success (identify a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method. Several commonly used mechanical participle methods are as follows:

1 forward maximum matching method (from left to right direction);

2 Reverse Maximum matching method (from right to left direction);

3 Minimum segmentation (the smallest number of words in each sentence).

These methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a bidirectional matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also through the use of various other language information to further improve the accuracy of segmentation.

One method is to improve the scanning mode, called feature scanning or symbol segmentation, priority in the string to be analyzed to identify and cut out some of the obvious features of the words, as a breakpoint, the original string can be divided into smaller strings and then into the mechanical participle, thereby reducing the matching error rate. Another method is to combine the word segmentation and lexical tagging, use rich parts of speech to help the decision making, and in the process of tagging in turn to the results of the word segmentation test, adjust, so as to greatly improve the accuracy of segmentation.

For mechanical Word segmentation method, can establish a general model, in this respect has the specialized academic thesis, here does not do the elaboration.

2, based on understanding of the word segmentation method

The method of Word segmentation is to make the computer simulate the people's understanding of the sentence, to achieve the effect of recognizing words. The basic idea is to make syntactic and semantic analysis at the same time, and use syntactic and semantic information to deal with ambiguity. It usually consists of three parts: the segmentation subsystem, the syntactic system, the general control part. Under the coordination of the general control part, the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This kind of word segmentation method needs to use a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machine direct reading, so the word segmentation system based on understanding is still in the experimental stage.

3. Segmentation method based on statistics

In terms of form, words are a combination of stable words, so the more times the adjacent words appear in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent words and characters can better reflect the credibility of the word. The frequency of the combination of the adjacent words in the corpus can be counted, and their mutual information is calculated. Define the two-word mutual present information and compute the adjacent probability of two Chinese characters X and Y. The mutual information embodies the close degree of the bond between Chinese characters. When the tightness is higher than a certain threshold, it can be assumed that the word group may constitute a word. This method can only be used to statistics the frequency of the words in the corpus, do not need to cut the dictionary, so it is also called No dictionary segmentation method or statistical method.

But this method also has certain limitation, will often take out a number of common frequently high, but not the words of the commonly used groups, such as "This One", "one", "some", "my", "many" and so on, and the common word recognition accuracy is poor, time and space overhead. The actual application of the statistical word segmentation system is to use a basic word dictionary (commonly used word dictionary) for string matching participle, at the same time using statistical methods to identify some new words, the serial frequency statistics and string matching, not only to play the matching segmentation speed, high efficiency, but also the use of dictionary segmentation and context to identify words, The advantages of automatically eliminating ambiguity.

In the end which Word segmentation algorithm accuracy is higher, at present has no conclusion. For any mature word segmentation system, it is impossible to rely on a single algorithm to achieve, all need to synthesize different algorithms. I understand that the vast number of technology to use the word segmentation algorithm "Compound word segmentation", the so-called compound, equivalent to the use of traditional Chinese medicine in the concept of compound, that is, the combination of different drugs to treat diseases, the same, for the recognition of Chinese words, need a variety of algorithms to deal with different problems

Problems in participle

Have a mature word segmentation algorithm, whether it can easily solve the problem of Chinese participle? The truth is far from it. Chinese is a very complex language, it is more difficult for the computer to understand the Chinese language. In the Chinese word segmentation process, there are two major problems have not been completely broken.

1. Ambiguity recognition

Ambiguity refers to the same sentence, there may be two or more methods of segmentation. For example: surface, because "surface" and "face" are words, then this phrase can be divided into "surface" and "surface." This is called cross ambiguity. Like this intersection ambiguity is very common, the previous "Kimono" example, in fact, because of the intersection ambiguity caused by the error.

"Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing." It is difficult for a computer to know exactly which program is right because no one knows what it is.

Cross ambiguity is relatively easy to deal with relative combinatorial ambiguity, and combinatorial ambiguity must be judged by the whole sentence. For example, in the sentence "This doorknob is broken", "handle" is a word, but in the sentence "Please take your hands off", "the handle" is not a word; "In the sentence" The general appointed an Admiral "," Lieutenant "is a word, but in the sentence" output will grow twice times in three years "," lieutenant "is no longer a word. How do computers identify these words?

If the intersection ambiguity and the combination ambiguity computer can solve, there is still a problem in the ambiguity, is the real ambiguity. Really ambiguous meaning is to give a word, by people to judge also don't know which should be the word, which should not be words. For example: "Table tennis Auction is over", can be divided into "table tennis auction finished", can also be cut into the "table tennis auction is over", if there is no context other sentences, I am afraid who do not know "auction" here is not a word.

2. Recognition of new words

New words, the technical term is called the unregistered word. Those words that are not included in the dictionary, but which are actually called words. The most typical is the name, people can easily understand the sentence "Wang Junhu to Guangzhou", "Wang June Tiger" is a word, because it is a person's name, but if the computer to identify the difficulty. If the "Wang June Tiger" as a word included in the dictionary, the world has so many names, and every moment there are new names, the inclusion of these names is a huge project. Even if this work can be completed, there will be problems, such as: In the sentence "Wang June bibs", "Wang June Tiger" can not calculate the word?

In addition to the names of the new words, there are institutional names, place names, product names, trademarks, abbreviations, ellipsis, etc. are difficult to deal with, and these are just the words people often use, so for search engines, Word segmentation system in the new word recognition is very important. At present, the accuracy rate of new words recognition has become one of the important signs to evaluate the quality of a word segmentation system.

The application of Chinese word segmentation

At present, in the natural language processing technology, Chinese processing technology than the western processing technology to lag a large distance, many Western language processing methods can not be directly used in Chinese, because Chinese must have participle this process. Chinese participle is the basis of other Chinese language processing, search engine is only one application of word segmentation. Other such as machine translation (MT), Speech synthesis, automatic classification, automatic summary, automatic proofreading, etc., need to use participle. Because Chinese need participle, may affect some research, but also for some enterprises to bring opportunities, because foreign computer processing technology to enter the Chinese market, the first is to solve the Chinese word segmentation problem. In Chinese studies, Chinese people have a very obvious advantage over foreigners.

Word segmentation accuracy is very important for search engines, but if the speed is too slow, even if the accuracy is high, for the search engine is also not available, because the search engine needs to deal with hundreds of millions of pages, if the word consumption for too long, will seriously affect the speed of the search engine content update. Therefore, for search engines, the accuracy and speed of participle, both need to achieve high requirements. At present, the study of Chinese word segmentation is mostly scientific research institutions, Tsinghua, Peking University, the Chinese Academy of Sciences, Beijing Language Institute, northeastern Universities, IBM Institute, Microsoft China Research Institute, etc. have their own team, and the real professional study of Chinese word segmentation of the commercial companies in addition to a large amount of technology, almost Research institutions, most of the technology can not be quickly product, and a professional company after all limited, it seems that Chinese word segmentation technology to better serve more products, there is a long way to go ...

All right, SWJ, here we are. As long as we can fully understand the word segmentation technology so take SEO all aspects of performance you are good.

But this is no decisive and only sex does not want someone to drill the horns of this problem! Seriously do more practice station more accumulation and more study!

Original from: Shanghai SEO http://www.seo-sh.cn/seo/223.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SWJ Small Talk: the function and outline of Chinese participle-sermon article!

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support