Ruby Chinese search support ideas

Source: Internet
Author: User

In Ruby, how does one perform Chinese Word Segmentation search?

This issue is currently under consideration,Target is to useFerret,Ferret is the ruby version of Lucene search engine. With a little Lucene experience, I personally like ferret very much and rely on Lucene's powerful word segmentation, index, and search functions, many interesting functions can be made, but Ferret has two difficulties.

  1. Chinese Word Segmentation is not supported, and it is difficult for me to integrate third-party word segmentation (I have no experience in C development, and I know that the Chinese word segmentation algorithm is in Java version)
  2. Sometimes index errors may occur, as if the segment address is incorrect. Because ferret is at the front-end index, the error is also in the front-end fastcig process.

Original idea

Using the compatibility of ferret and Lucene, a Java class library is written separately to index resources on a regular basis. However, to improve the efficiency (index is faster than Lucene), the new version of ferret is not compatible with Lucene, this idea is not feasible.

Updated ideas

Using the MySQL Chinese Word Segmentation plug-in, this is a good thing, but it only supports mysql4.0, and MySQL 5.1 beta, just does not support mysql5.0.x currently used because there is no ready-made version available, I had to give up (this plug-in can only be used for full-text search, unlike ferret, which has other interesting functions)

Current ideas:

  1. In ferret, use rjb to call Lucene's Chinese Word Segmentation and then Index
  2. You need to add an attribute indexed to all models of the index. You can set indexed to false when creating or updating the index.
  3. Package Chinese and foreign word segmentation written in Java into a DRB service through rjb, or simply link it to resin for HTTP page calling.
  4. In the background, individual programs start to select all records not indexed at regular intervals (indexed = false), and then call Step 2 for word segmentation one by one, that is, errors in the process, the front-end page is not affected. Here, only the ruby code of ferret is modified.
  5. If possible, the search program uses the remote protocol to split the keyword and then searches

Disadvantages

  1. This solution looks disgusting. It looks like a piece of clothing with patches everywhere. Currently, there is no time to study the C code in ferret. I have to use Java to bypass the circle.

  2. If index is run in the background, the current article cannot be searched in real time, but there is a delay, but I think this is a good solution. in addition, even if an error occurs in the background index, the front-end page is not affected, and some records cannot be searched.

  3. It may affect the speed of keyword segmentation, but the impact should be very small.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.