Dictionary-Based Reverse Chinese word segmentation in the search system

Source: Internet
Author: User
Full-text search
A full-text search DD was created some time ago.
Use oracletext to create full-text indexes and implement full-text retrieval
Create a paging stored procedure on the database side for full-text retrieval.

Information Extraction
In this way, you only need to handle some issues in the presentation layer.
Including extracting the display of the parts that are most relevant to user requirements.
Highlight search keywords, etc.
The text that the user most wants to see is the text with the largest amount of information,
And highlighted search keywords,
What should we decide?
Obviously, it should be determined by the smallest unit containing information.
In a Chinese system, the smallest unit is Chinese words.
Chinese Word Segmentation
There are many studies and achievements in Chinese Word Segmentation in China
In our system, we use a very simple method:
Dictionary-Based Reverse Chinese Word Segmentation
Dictionary files can use the dictionary added in Chinese.
Can be obtained on its website
Match AlgorithmVery simple,
High time requirements in search systems
Therefore, only two or three words are matched (we will make up for the error of ignoring multiple words later)
3 words are detected each time, not 2 words are detected
Then add the Auxiliary Word judgment, that is, the three words (ABC) are not words,
However, both (AB) and (BC) are words.
In this case, the parts of speech of A and C can be determined.
My solution is to create an Auxiliary Word Dictionary.
Search for the word class
Reverse word combination
In order to make up for the loss of multiword word segmentation caused by the pursuit of speed in the previous Word Segmentation
We reverse combine the word splitting result.
Combine adjacent word splitting results and discard the final word result
Word Order after combination in descending order by the amount of information
That is, the result of Word Segmentation is (AB) (C) (de)
Combined into (ABC) (CDE) (AB) (de)
On the one hand, this can also make long words (generally, such words have a large amount of information) arranged in front
In this way, the accuracy of Word Segmentation is greatly improved.
And the performance improvement is obvious:
In the search system, the keyword is generally not long.
Time for combining about 10 keywords
It is far less than the loss of more than 4 words in the dictionary.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.