Chinese Word Segmentation-reprint 2 _ a Beijing programmer

Source: Internet
Author: User
2. New Word Recognition
The new term is called the Unlogged-on term. That is, the words that have not been included in the dictionary but can indeed be called words. The most typical is the name of a person. In the sentence "Wang junhu has gone to Guangzhou", "Wang junhu" is a word, because it is a person's name, however, it would be difficult for the computer to identify it. If "Wang junhu" is used as a word to be indexed into the dictionary, there are so many names all over the world, and there are new names every moment. recording these names is a huge project. Even if this work can be completed, there will still be problems. For example, in the sentence "Wang Jun and Hu HU", can "Wang junhu" be regarded as a word?
In addition to the name of a person, the organization name, place name, product name, trademark name, abbreviation, and omitting of a new word are difficult to solve. These words are frequently used by people, therefore, word segmentation is very important for search engines. At present, the accuracy of New Word Recognition has become one of the important indicators for evaluating the quality of a word segmentation system.
Application of Chinese Word Segmentation
At present, in natural language processing technology, Chinese processing technology lags behind Western processing technology for a long time, and many Western processing methods cannot be used directly in Chinese, it is because the process of word splitting is required for Chinese characters. Chinese word segmentation is the basis for processing other Chinese information. search engines are only an application of Chinese word segmentation. Word Segmentation is required for other words, such as machine translation (MT), speech synthesis, automatic classification, automatic summarization, and automatic proofreading. Chinese word segmentation may affect some research, but it also brings opportunities for some enterprises, because the foreign computer processing technology to enter the Chinese market, the first thing is to solve the problem of Chinese word segmentation. In terms of Chinese research, Chinese people have obvious advantages over foreigners.
Word Segmentation accuracy is very important for search engines. However, if the word segmentation speed is too slow, even if the accuracy is higher, it is not available for search engines because search engines need to process hundreds of millions of webpages, if the time consumed by word splitting is too long, the search engine content update speed will be seriously affected. Therefore, for search engines, Word Segmentation accuracy and speed both must meet high requirements. At present, most of the Chinese word segmentation research institutions are research institutions. Tsinghua, Peking University, Chinese Emy of Sciences, Beijing Language Institute, Northeastern University, IBM Research Institute, Microsoft Chinese Research Institute, and so on all have their own research teams, however, commercial companies that really specialize in Chinese word segmentation have almost no more than massive technologies. Most of the technologies studied by scientific research institutions cannot be productized quickly, but the power of a professional company is limited. It seems that there is still a long way to go before the Chinese Word Segmentation technology can better serve more products.

Compile a simple Chinese word segmentation program

A few months ago, I found a Chinese Dictionary material (several hundred KB) on the Internet and wanted to write a word segmentation program. I have no research on Chinese word segmentation, so I can write it on my own imagination. if you have relevant experts, Please give more comments.

I. lexicon

There are more than 50 thousand words in the dictionary (similar words can be searched by Google). My summary is as follows:

Region 82
Important 81
Xinhua News Agency 80
Technology 80
Meeting 80
Self 79
Cadre 78
Employee 78
Masses 77
No 77
Today 76
Comrade 76
Department 75
Enhancement 75
Organization 75

The first column is the word, and the second column is the weight. The word segmentation algorithm I wrote does not currently use the weight.

Ii. Design Ideas

Brief Algorithm Description:

Scan a string s from the front to the back to find the longest match for each word scanned in the dictionary. for example, suppose S = "I am a citizen of the People's Republic of China", including "People's Republic of China", "China", "Citizen", "people", "Republic "...... and other words. when the word "medium" is scanned, take 1, 2, 3 ,...... chinese characters ("medium", "China", "Chinese", "Chinese people", "Chinese People's Republic", "People's Republic of China ",, "People's Republic of China"), the longest matching string in the dictionary is "People's Republic of China", so this is separated, the scanner advances to the word "public.

Data structure:

The selected data structure has a significant impact on performance. I use hashtable _ roottable to record the dictionary. key-value pairs are (Key, number of inserts ). for each word, if the word has n characters ~ 2, 1 ~ 3,... 1 ~ N words are used as keys in the insert _ roottable. If the same key is inserted repeatedly, the subsequent values increase progressively.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.