IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was found that BAKA Baidu did not include links on the site at all. Therefore, the Chairman strongly urged that the on-site search (: Intra-Site Search) MongoDB word segmentation search does not support Chinese characters, while the exact match is basically
IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was found that BAKA Baidu did not include links on the site at all. Therefore, the Chairman strongly urged that the on-site search (: Intra-Site Search) MongoDB word segmentation search does not support Chinese characters, while the exact match is basically
IDocument document sharing platform developed for the studio. Because the back-end database is MongoDB, it does not perform intra-site search. It just adds a Baidu intra-site search tool. However, it was later found that BAKA Baidu did not include links in the site, so the Chairman strongly urged intra-site search.(: Invalid "parameters)
MongoDB's Word Segmentation search does not support Chinese characters, but exact matching is basically meaningless. So consider several methods.
- Synonym matching, high-end play, the initial idea is to add
tag
Search for keywords. For example, the document of "Advanced Mathematics" can contain the "high number" tag, which can be obtained by searching for "Advanced Mathematics" and "high number.
- Word-based matching may cause a slight error, but it can reduce the burden for developers and website operators because tags do not need to be manually maintained.
Considering that the database already contains some of the previous data (and the Chairman said that the document was too tired to be uploaded, so I strongly asked me to keep the original data), I chose the word-based search method here.
The basic steps are as follows:
- The name of the word splitting function is
split()
, The document title istitle
, The search field isquerytext
.
- When a new document object is saved, it contains an attribute named
searchIndex
, Value:split(title)
. That istitle
Array obtained after word splitting.
- Set
searchIndex
Indexed.ensureIndex({searchIndex: 1})
If it is in sequence, meengju mentioned previously, but it is not considered here.
- When searching
split(querytext)
Hand over to databasefind()
Javascript comes with a super easy-to-useString.prototype.split()
Method, so the word splitting istitle.split('')
.
And meengju said that if arrays are used, they cannot be used./variable/i
Such regular expressions are case-insensitive. Therefore, when the new index is saved, it must be converted to lowercase letters, and the search must be converted to lowercase letters to implement Case-insensitive search.
Modify the routing function and add a new property.searchIndex
:
var newdoc = { title: req.body.title, updateTime: Math.round((new Date()).getTime() / 1000), fileType: req.body.fileType, belongs: req.body.belongs, course: req.body.courseId, type: req.body.type, link: req.body.link, downloads: 0, searchIndex: req.body.title.toLowerCase().split('').clean(" ") // Remove space element after split to single text.};Array.prototype.clean = function (deleteValue) { for (var i = 0; i < this.length; i++) { if (this[i] == deleteValue) { this.splice(i, 1); i--; } } return this;};
Modify the document model function.searchIndex
Indexed:
exports.addnew = function(newdoc, callback) { collection.insert(newdoc, {safe: true}, function(err) { if (err) { return callback(err); } collection.ensureIndex({ searchIndex: 1 }, function(err) { if (err) { return callback(err); } callback(null); }); });}
Create a document search function.Pitfall: if an array with good words is directly handed over to the database for "find", the exact matching effect is the same. I am not familiar with MongoDB...
The correct matching method isdb.documents.find({ searchIndex: {$all: splittedTextArray} })
.
exports.searchdoc = function(splittedTextArray, callback) { collection.find({ searchIndex: {$all: splittedTextArray} }) .sort({ downloads: -1 }) .toArray(function(err, docs) { if (err) { return callback(err, null); } callback(null, docs); });}
Then, according to the Chairman's requirements, I wrote a script to update the previous data. The method is to retrieve all the original data, traverse and write it back, and addsearchIndex
. After reading the original data, there are about 100 million data records, so I did not consider the segmentation issue.
According to meengju, array processing is a strong point of MongoDB, so it is very fast ~
The source code of the entire project is here
The modification is complete, and the test results are very good. w, but BAKA's blueed hasn't written the style on the search list page yet...It is said that the front-end is hard to force
Original article address: MongoDB Chinese Word Segmentation search, thanks to the original author for sharing.