Full-text search
A full-text search DD was created some time ago.
Use oracletext to create full-text indexes and implement full-text retrieval
Create a paging stored procedure on the database side for full-text retrieval.
Information Extraction
In this way, you only need to handle some issues in the presentation layer.
Including extracting the display of the parts that are most relevant to user requirements.
Highlight search keywords, etc.
The text that the user most wants to see is the text with the largest amount of information,
And highlighted search keywords,
What should we decide?
Obviously, it should be determined by the smallest unit containing information.
In a Chinese system, the smallest unit is Chinese words.
Chinese Word Segmentation
There are many studies and achievements in Chinese Word Segmentation in China
In our system, we use a very simple method:
Dictionary-Based Reverse Chinese Word Segmentation
Dictionary files can use the dictionary added in Chinese.
Can be obtained on its website
Match AlgorithmVery simple,
High time requirements in search systems
Therefore, only two or three words are matched (we will make up for the error of ignoring multiple words later)
3 words are detected each time, not 2 words are detected
Then add the Auxiliary Word judgment, that is, the three words (ABC) are not words,
However, both (AB) and (BC) are words.
In this case, the parts of speech of A and C can be determined.
My solution is to create an Auxiliary Word Dictionary.
Search for the word class
Reverse word combination
In order to make up for the loss of multiword word segmentation caused by the pursuit of speed in the previous Word Segmentation
We reverse combine the word splitting result.
Combine adjacent word splitting results and discard the final word result
Word Order after combination in descending order by the amount of information
That is, the result of Word Segmentation is (AB) (C) (de)
Combined into (ABC) (CDE) (AB) (de)
On the one hand, this can also make long words (generally, such words have a large amount of information) arranged in front
In this way, the accuracy of Word Segmentation is greatly improved.
And the performance improvement is obvious:
In the search system, the keyword is generally not long.
Time for combining about 10 keywords
It is far less than the loss of more than 4 words in the dictionary.