In Ruby, how does one perform Chinese Word Segmentation search?
This issue is currently under consideration,Target is to useFerret,Ferret is the ruby version of Lucene search engine. With a little Lucene experience, I personally like ferret very much and rely on Lucene's powerful word segmentation, index, and search functions, many interesting functions can be made, but Ferret has two difficulties.
- Chinese Word Segmentation is not supported, and it is difficult for me to integrate third-party word segmentation (I have no experience in C development, and I know that the Chinese word segmentation algorithm is in Java version)
- Sometimes index errors may occur, as if the segment address is incorrect. Because ferret is at the front-end index, the error is also in the front-end fastcig process.
Original idea
Using the compatibility of ferret and Lucene, a Java class library is written separately to index resources on a regular basis. However, to improve the efficiency (index is faster than Lucene), the new version of ferret is not compatible with Lucene, this idea is not feasible.
Updated ideas
Using the MySQL Chinese Word Segmentation plug-in, this is a good thing, but it only supports mysql4.0, and MySQL 5.1 beta, just does not support mysql5.0.x currently used because there is no ready-made version available, I had to give up (this plug-in can only be used for full-text search, unlike ferret, which has other interesting functions)
Current ideas:
- In ferret, use rjb to call Lucene's Chinese Word Segmentation and then Index
- You need to add an attribute indexed to all models of the index. You can set indexed to false when creating or updating the index.
- Package Chinese and foreign word segmentation written in Java into a DRB service through rjb, or simply link it to resin for HTTP page calling.
- In the background, individual programs start to select all records not indexed at regular intervals (indexed = false), and then call Step 2 for word segmentation one by one, that is, errors in the process, the front-end page is not affected. Here, only the ruby code of ferret is modified.
- If possible, the search program uses the remote protocol to split the keyword and then searches
Disadvantages
This solution looks disgusting. It looks like a piece of clothing with patches everywhere. Currently, there is no time to study the C code in ferret. I have to use Java to bypass the circle.
If index is run in the background, the current article cannot be searched in real time, but there is a delay, but I think this is a good solution. in addition, even if an error occurs in the background index, the front-end page is not affected, and some records cannot be searched.
It may affect the speed of keyword segmentation, but the impact should be very small.