Citigroup Word Segmentation technology
Xu Yabo
1. What is Word Segmentation technology?
The Word Segmentation technology is used by the SE to query Key Strings submitted by the user based on the user's keyword string.
A technology used by various matching methods.
2. Before introducing Word Segmentation technology, it is necessary to briefly introduce the search engine technology:
A search engine crawls a large amount of information and intelligently extracts the retrieved information.
Analysis and other processing. It stores a large amount of processed information, sorts the information, and queries quickly. Engine refers to the system
It not only can store hundreds of millions of data, but also has a huge concurrent processing capability. Such a system can be called
Engine. Search mainly involves the design of crawler programs. This work was done by Wang zhengdi and Zhou xuanyuan, suo's work.
We are using open-source Chinese search engines,
Coreseek,
Including the sorting algorithm,
Chinese Word Segmentation and other technologies.
What I do is to split the information that users enter to be queried, and perform operations based on coreseek.
In
One-step optimization.
Key technologies involved in search engines include:
1. Web Crawler
2. Chinese and English Word Segmentation Language Processing
3. Sorting Algorithm
4. query/Storage Technology
Web Crawler is a program that automatically extracts web pages. It Downloads Web pages from the Internet for search engines. Web Crawler capabilities
Crawls static webpages, dynamic webpages, special content, file objects, and so on, using multi-threaded methods. Word Segmentation as its name implies
It is to divide user input into words and use words to create indexes for queries. Therefore, word segmentation is a query storage technology.
The premise.
For English,
We can simply use space and punctuation,
You can perform word segmentation successfully,
However, for the East Asian languages represented by Chinese, word segmentation is not that simple. Maybe we will think
Every word in a word is divided into one word, but the search results of this word segmentation method are obviously not satisfactory. Find
A proper splitting method is my main task in Citi cup. And for search engines,
Word segmentation speed score accuracy is more important. Word Segmentation requires word segmentation. The creation of Word Segmentation also depends on
Professional at the School of Management, Eddie. Sorting algorithms determine the importance of data such as webpages, images, and mp3.
Sequence. This work was handed over to Wang zhengdi. In fact, coreseek was able to complete the work, which he was responsible. Last Query
Storage, which is also the work of coreseek. It mainly includes how the system stores tens of billions of webpage data
Provides scientific and efficient search results. Data storage mainly includes two parts: one is webpage data. Another
It is a dictionary index data, which includes a Chinese Dictionary and an English word. Each word corresponds to a webpage number queue.
In this case, the search engine technology displayed in front of me is quite simple. In fact, all I need to know is
In the right half,
There is a database Index in front of me,
This database is a word-web index database,
The user inputs a sentence instead of a word. Therefore, if we want to index a specific webpage, we must first
Word Segmentation, of course, I can also use the Chinese word segmentation, English Word Segmentation and other databases in the left half to provide my word segmentation algorithm.
Provide support. As for how the Index database is built, although this is the most important part of the search engine
It has nothing to do with me. I am studying word segmentation (in fact, this database is also mentioned earlier, mainly crawling web pages, etc.
Object, and then build the database through content extraction and other technologies)
In addition, I also need to pay attention to sorting algorithms and word libraries.
Only pay attention to the updates, because the two parts are mainly under the responsibility of Wang zhengdi and Eddie.
The following is a brief illustration of the Search "Qu Bo search engine": First word segmentation, then
After finding the Index queue of each word in the Index database, perform the logical operation:
The final webpage sequence is the search result.
3. Word Segmentation Algorithm
Basic operations for Chinese Word Segmentation:
1. If no more than three text strings are submitted by the user, the words are indexed directly to the database. More than 4
Text,
Use separators such as spaces,
Punctuation Marks,
Splits a query string into several subquery strings.
2,
Then, let's see if the word provided by the user has repeated words. If yes, it will be discarded. The default value is
Vocabulary. Next, check whether the string submitted by the user contains letters and numbers. If there is one, put the letter
And numbers are considered as a word.
Generally, you can perform Chinese Word Segmentation in three ways:
1. String Matching
2. Semantic Analysis
3. Statistical Word Segmentation
Because the string matching method is mature, we use this method for Chinese word segmentation. May be integrated
Scoring lexical. Because semantic analysis is not yet mature, we may not use this method.
The string matching method can be divided:
1. Forward maximum matching
2. Reverse maximum matching
3. Shortest Path
Here, I will also systematically learn a MMSeg word segmentation method, which is the Chinese word segmentation method adopted by coreseek. I asked
After a while, he said they were taking a little change based on the MMSeg word segmentation method. Currently
I am contacting the person who is responsible for word splitting. Since he is not always in gtalk, he hasn't been contacted yet. =. No
I need to learn a lot now, and it is not very urgent.
MMSeg word segmentation algorithms are well implemented, and there are many implementations on the Internet. In addition, it is difficult for me to create one.
A new algorithm is faster than the MMSeg Algorithm in terms of Word Segmentation and higher accuracy. After all, it is a result of continuous efforts by many scholars.
Although it is open-source. Therefore, I plan to use the MMSeg word splitting algorithm as the basis. The basic goal is to use
Python implementation meets our basic needs. Of course, I will try my best to read more articles and see if I can find them.
The MMSeg word segmentation algorithm is improved based on the MMSeg word segmentation algorithm.
Write it here first. This is the mid-term exam .. T
.. T
Reference:
Chinese Word Segmentation http://www.matrix67.com/blog/archives/4212
Research on http://www.0ssifrage.com/blog/archives/78 of MMSeg Word Segmentation Algorithm
Mantalk semantic analysis http://www.matrix67.com/blog/archives/4870