MongoDB Chinese Word Segmentation search

Source: Internet
Author: User
IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was found that BAKA Baidu did not include links on the site at all. Therefore, the Chairman strongly urged that the on-site search (: Intra-Site Search) MongoDB word segmentation search does not support Chinese characters, while the exact match is basically

IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was found that BAKA Baidu did not include links on the site at all. Therefore, the Chairman strongly urged that the on-site search (: Intra-Site Search) MongoDB word segmentation search does not support Chinese characters, while the exact match is basically

IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was later found that BAKA Baidu did not include links in the site, so the Chairman strongly urged intra-site search.(: Invalid "parameters)

MongoDB's Word Segmentation search does not support Chinese characters, but exact matching is basically meaningless. So consider several methods.

  • Synonym matching, high-end play, the initial idea is to addtagSearch for keywords. For example, the document of "Advanced Mathematics" can contain the "high number" tag, which can be obtained by searching for "Advanced Mathematics" and "high number.
  • Word-based matching may cause a slight error, but it can reduce the burden for developers and website operators because tags do not need to be manually maintained.

Considering that the database already contains some of the previous data (and the Chairman said that the document was too tired to be uploaded, so I strongly asked me to keep the original data), I chose the word-based search method here.

The basic steps are as follows:

  1. The name of the word splitting function issplit(), The document title istitle, The search field isquerytext.
  2. When a new document object is saved, it contains an attribute namedsearchIndex, Value:split(title). That istitleArray obtained after word splitting.
  3. SetsearchIndexIndexed.ensureIndex({searchIndex: 1})If it is in sequence, meengju mentioned previously, but it is not considered here.
  4. When searchingsplit(querytext)Hand over to databasefind()

Javascript comes with a super easy-to-useString.prototype.split()Method, so the word splitting istitle.split('').

And meengju said that if arrays are used, they cannot be used./variable/iSuch regular expressions are case-insensitive. Therefore, when the new index is saved, it must be converted to lowercase letters, and the search must be converted to lowercase letters to implement Case-insensitive search.

Modify the routing function and add a new property.searchIndex:

var newdoc = {    title: req.body.title,    updateTime: Math.round((new Date()).getTime() / 1000),    fileType: req.body.fileType,    belongs: req.body.belongs,    course: req.body.courseId,    type: req.body.type,    link: req.body.link,    downloads: 0,    searchIndex: req.body.title.toLowerCase().split('').clean(" ") // Remove space element after split to single text.};Array.prototype.clean = function (deleteValue) {    for (var i = 0; i < this.length; i++) {        if (this[i] == deleteValue) {            this.splice(i, 1);            i--;        }    }    return this;};

Modify the document model function.searchIndexIndexed:

exports.addnew = function(newdoc, callback) {    collection.insert(newdoc, {safe: true}, function(err) {        if (err) {            return callback(err);        }        collection.ensureIndex({            searchIndex: 1        }, function(err) {            if (err) {                return callback(err);            }            callback(null);        });    });}

Create a document search function.Pitfall: if an array with good words is directly handed over to the database for "find", the exact matching effect is the same. I am not familiar with MongoDB...

The correct matching method isdb.documents.find({ searchIndex: {$all: splittedTextArray} }).

exports.searchdoc = function(splittedTextArray, callback) {    collection.find({        searchIndex: {$all: splittedTextArray}    })        .sort({ downloads: -1 })        .toArray(function(err, docs) {            if (err) {                return callback(err, null);            }            callback(null, docs);        });}

Then, according to the Chairman's requirements, I wrote a script to update the previous data. The method is to retrieve all the original data, traverse and write it back, and addsearchIndex. After reading the original data, there are about 100 million data records, so I did not consider the segmentation issue.

According to meengju, array processing is a strong point of MongoDB, so it is very fast ~

The source code of the entire project is here

The modification is complete, and the test results are very good. w, but BAKA's blueed hasn't written the style on the search list page yet...It is said that the front-end is hard to force

Original article address: MongoDB Chinese Word Segmentation search, thanks to the original author for sharing.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.