Introduction to Chinese Word Segmentation in search engines

Source: Internet
Author: User

Differences and relationships between full-text search and search engine
We mentioned Chinese Word Segmentation and an analysis of the implementation principle of double array trie (double array trie ).
Describes the realization of efficient Chinese word segmentation. Next, let's move away from the formulas of the double array trie and learn about word segmentation from the concept. Because the English word segmentation is relatively simple, we will mainly learn about Chinese Word Segmentation here.

English is in the unit of words. Words and words are separated by spaces, while Chinese is in the unit of words. All words in a sentence can be connected to each other to describe a meaning. For example
Student, translated as "I am a student ". The computer can easily know through spaces that student is a word, but if "Learning" and "being" are separated, the computer cannot understand it.
. They must be combined to make sense. The Chinese Character Sequence is segmented into meaningful words, that is, Chinese word segmentation. For example, "research life" can be divided into "graduate student/life" or "Research
/Life ", if it is the human brain, it can clearly determine that the latter is more accurate, but it is quite difficult for computers to do so.

There are three existing word segmentation algorithms: String-matching-Based Word Segmentation Algorithms, comprehension-based word segmentation algorithms, and statistical-based word segmentation algorithms.


What is Chinese Word Segmentation

What is word segmentation? What is the difference between Chinese Word Segmentation and other word segmentation? Word segmentation refers to the process of re-composing word sequences according to certain specifications for continuous word sequences.
In the above example, we can see that in English lines, words use spaces as natural delimiters, however, Chinese only supports Simple demarcation of words, sentences, and segments through obvious delimiters.
Although the Division of phrases also exists in English, at the word layer, we can also see in the above example that, chinese is much more complex and difficult than English.

Meanings and functions of Chinese Word Segmentation

To clearly understand the meaning and function of Chinese word segmentation, we must mention intelligent computing technology. Study of Intelligent Computing Technology
Subjects include physics, mathematics, computer science, electronic machinery, communication, physiology, evolutionary theory and psychology. Simply put, intelligent computing enables machines to "think, hear, and talk" and make computers look like humans.
You can quickly determine the segmentation of the phrase "research life. To achieve this goal, we must first let machines understand human languages. Only machines understand human languages and texts can make people and machines
Communication is possible. On the other hand, in our human language, "the word is the smallest meaningful language component that can be independently active". Therefore, for Chinese, determining the word is the first step to understand the natural language. Only
After this step, Chinese can transition to phrase division, concept extraction, and topic analysis like English, so that natural language understanding can finally reach the highest level of intelligent computing and realize human dreams. Currently, we often
Mainstream search engines such as Google, Baidu, and Yahoo are still matching results based on keywords (keyword). However, fortunately, many companies have invested a lot in
The amount of money is used to search by natural language. With the study, computers can understand human languages to realize real man-machine interaction, it will no longer be out of reach.
Is available. According to the actual situation at this stage, English has already crossed the word segmentation step. That is to say, we have already taken the lead in Word utilization and have demonstrated good application prospects, whether it is information retrieval or
Theme analysis is better than Chinese. The root cause of theme analysis is that Chinese must be segmented. Only by breaking this obstacle can we hopefully catch up with and surpass the development of English in the information field, therefore, Chinese Word Segmentation
It is of great significance to us. It can be said that it directly affects all aspects of every person who uses Chinese.

Application of Chinese Word Segmentation

Chinese Word Segmentation is mainly used in information retrieval, human-computer interaction, information extraction, Text Mining, Chinese and foreign translation, Chinese proofreading, automatic summarization, and automatic classification. The following uses information retrieval as an example to describe the application of Chinese word segmentation.
Over the past few years, the Internet is no longer far away from us. Information on the Internet is also expanding rapidly. In this massive amount of information, all kinds of information are mixed together. To make full use of these information resources
It is impossible to organize the work by people. If the word segmentation technology is not used for Chinese information, the results will be too rough, leading to resource unavailability, for example, a common
Classic example: "the manufacturing and service industries are two different industries." and "the kimono we export to Japan has increased compared with last year ", it is processed as the same class, and the result is the retrieval "kimono"
Will retrieve all of them, and it seems to be tolerable when there is little information. If there is a large amount of information, such results will be annoying. By introducing Word Segmentation technology, machines can
Information sorting is more accurate and reasonable.
In "manufacturing and service industries are two different industries", "kimono" will not be processed as a word, so the retrieval of "kimono" will of course not retrieve it, making the retrieval results more accurate, efficiency will be greatly improved
Therefore, the application of Chinese word segmentation will improve our lives and make people truly realize that technology is used by me. Most of the current research on Word Segmentation focuses on common word segmentation algorithms to improve word segmentation accuracy.

In the current Word Segmentation Algorithms, some of the algorithms with high splitting accuracy are slow. However, some of the algorithms with high splitting speed discard some tedious language processing, therefore, the splitting accuracy is not high.

Speed: tens of K ~ MB

Segmentation accuracy: 80% ~ 98%

Chinese Word Segmentation overview and difficulties

Chinese Word Segmentation
Segmentation): Splits a Chinese Character Sequence into separate words. For example, if you divide "one-time full 100 yuan" into "one-time/sexual intercourse/full/100/Yuan ",
Word Segmentation is not what we want. In addition, the word recognition is not logged on. For example, if the word Dictionary "Arnold" does not exist, how can the computer correctly identify this word?
It is not a challenge exclusive to Chinese word segmentation. It is also a problem in other languages such as English.

Word Segmentation Specification
: Concepts of words and requirements for splitting different applications

Word Segmentation Algorithm
: Ambiguity elimination and Unlogged-on Word Recognition

Difficulty in Word Segmentation Specification

Definition of Words in Chinese, that is, to eliminate ambiguity

"Mayor of Changchun": "Mayor of Changchun "? "Changchun city/long "? "Changchun/city/mayor "?

How can I collect words in the core Word Table?

Word deformation structure problem: "View/not/See", "do not believe"

Difficulty in Word splitting Algorithms

● Elimination of segmentation Ambiguity

Intersection type ambiguity (cross ambiguity): "combined"

We/groups/synthesis/hydrogen; combination/synthesis/molecules;

-Combination ambiguity (overwrite ambiguity): "immediate"


-"Student union organizing activities": "student/meeting/organizing/performing/Performing Activities" or "Student Union/organizing/performing "?

● Unregistered Word Recognition

-Named Entity: Number words, person name, place name, institution name, translation name, time, and currency

-Abbreviations and terminologies: "female" and "sars"

-New words: "Soy Sauce purple" and "star Tray"

● Identify known words first or Unlogged words first

-First identify known words: "netania/nonsense"

-First identify the Unlogged words: "victory depends on/Yu Yong/Qi"

Common evaluation indicators

Recall rate (recall)

Dictionary-based and rule-based methods

● Maximum matching

-Forward maximum match, reverse maximum match, and bidirectional maximum match

-Easy to implement and fast to split. However, the coverage ambiguity cannot be found, and some complicated cross-ambiguity will also be omitted.

The actual test results show that the accuracy of reverse maximum matching is higher than that of forward maximum matching.

● Full splitting

-Use dictionary matching to obtain all possible splitting results of a sentence.

-Very high space-time overhead.

● Comprehension-Based Word Segmentation Algorithm

-Simulates a person's understanding process and adds syntactic and semantic analysis to the word segmentation process to deal with ambiguity.

-It is difficult to organize information in various languages into a form that can be directly read by machines and is still in the trial stage.

Rule-based and unregistered Word Recognition

This step can also be processed in stemming, such as Snowball.
It is a good filter, but unfortunately the error rate of the current version is relatively high.

-Rule discrimination

Condition find (R, next, x) {% x. ccat = ~ W} select 1

Condition find (L, near, x) {% x. YX = listen | believe | agree} select 1

Condition find (L, near, x) {% x. YX = if | if} select 2

Otherwise select 1

-Use Rules to identify unregistered words

Locationname à person name locationnamekeyword

Locationname à location name locationnamekeyword

Organizationname à organization name organizationnamekeyword

Organizationname à country name {d | dd} organizationnamekeyword

● N-gram model

Hidden Markov Model (HMM)

For a random event, there is a status sequence {x1x2 ,..., Xn}, there is also a sequence of observed values {y1y2 ,..., Yn }. The hidden horse model can be formally formed into a quintuple (S, O, a, B), where:

S = {Q1, q2 ,..., Qn}: a finite set of Status values

O = {V1, V2 ,... VM}: a finite set of observed values

A = {AIJ}, AIJ = P (XT + 1 = Qj | XT = Qi): Transfer Probability

B = {Bik}, Bik = P (Ot = VK | XT = Qi): Output Probability

==}, = P (x1 = Qi): initial state distribution

: Splits a string sequence in the unit of N.

-For example, the binary splitting method: "abcdefg" → "AB/CD/EF/G"

-Overlapping bigram: "abcdefg" → "AB/BC/CD/DE/EF/FG"

-Simple and fast, but there will be a large number of meaningless index words, leading to the space of index files produced by indexing, as well as the time for retrieval and indexing. At the same time, because its splitting unit is not a word in the linguistic sense, it will also lead to a decrease in the retrieval precision.

1. the same word segmentation algorithm is used for query splitting and document splitting. Some Words with incorrect file splitting also encounter the same word segmentation error during query, so even if the splitting phase is incorrect, however, the final match of the same error can still be correctly retrieved;

2. some words are incorrectly divided into several parts. Although this will lead to a decrease in the word segmentation accuracy rate, for retrieval, the correct results can be obtained through merging results, incorrect word segmentation does not affect the search performance;

3. The accuracy of Word Segmentation measurement is not absolute. Sometimes it is related to standard answers. This involves the definition of words. Some standard answers are considered to be the words to be split. In fact, non-segmentation is used for more accurate search. For example, "National/internal" vs "domestic", "DPP group" vs "minjin/party group" vs "DPP/group"

1.The time performance of word segmentation algorithms is relatively high.
In particular, the current web search requires high real-time performance. Therefore, word segmentation, as the basis for Chinese Information Processing, must take as little time as possible.

2.The improvement of Word Segmentation accuracy does not necessarily increase the search performance.
After word segmentation reaches a certain level of accuracy, the impact on Chinese Information Retrieval is no longer obvious.
But this is not the performance bottleneck of CIR. Therefore, one-sided pursuit of a highly accurate word segmentation algorithm is not very suitable for large-scale Chinese Information Retrieval. Conflicts between time and precision cannot be considered
In this case, we need to find a proper balance between the two.

3.The splitting granularity can still follow the long-term priority rule, but it needs to be followed up at the query extension level.
In information retrieval, word segmentation algorithms only need to focus on how to eliminate cross-ambiguity. For overwriting ambiguity, we can use the secondary index and query extension of the dictionary to solve the problem.

4.The accuracy of unregistered word recognition is more important than the recall rate.
Try to ensure that the combination of non-Logon words is not correct, so as to avoid splitting the wrong non-Logon words. If a single word error is combined into a non-Logon word, the corresponding document may not be retrieved correctly.

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.