Implement a simple search engine with the go language

Last Update:2015-07-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Project address is: Https://github.com/wyh267/FalconEngine

Interested in the search engine can go to see this book, relatively shallow and more complete introduction of a search engine of all functions.

My search engine raw data is the MySQL database, you can do two times according to the needs of development, to support other databases or local files, detail files are stored in the Redis database, the same part can be developed according to their own needs two times, using local files or other databases, Inverted index and row index local storage when using the JSON format, compared to disk, the first version of the temporary so, and then do optimization.

The use method relies on several libraries

Github.com/outmana/log4jzl log File
Github.com/ewangplay/config configuration file parsing
Github.com/go-sql-driver/mysql MySQL Driver
Github.com/garyburd/redigo/redis Redis Drive
Github.com/huichen/sego word breaker, author homepage thank him very much for his analyzer, his homepage also has a search engine, did not see the concrete implementation, everyone interested can go to see.

Compile

Run directlyinstall.sh

Run

Get a dictionary file of participle from Github.com/huichen/sego
Run indexer to generate index files into index directory

Bin/falconengine-mode=build

Run the Finder

Bin/falconengine-mode=search

Basic concepts

The following concepts need to be understood, if not understood at all, also need to own a little Baidu:,,,,, 倒排索引 正排文件 Detail文件 全量索引 增量索引 哈希函数 ,DocId

Basic data structure

Search engine first is not mysterious, the basis of the data structure is so few, fixed later is in the above building blocks.

If you have the following five documents to search

Document Number	content
1	Hello, search engine
2	Search engines have a piece of data
3	Hello, there is a test data

Inverted index

Inverted index is the basis of the Search Engine Foundation, the main retrieval is from the inverted index start, so, first of all, design a inverted index data structure is the basis of all search engines.

The search engine is based on the docid, that is, the document ID,DOCID is unique and continuous, and the inverted index is a set of docid linked list, each linked list corresponds to a keyword.

The underlying structure of the above document build number inverted index is as follows

Key Words	Document Number
How are you doing	1,3
Search engine	The
Data	2,3
There is a piece	2,3
Test	3

So, when we retrieve the word, we 数据 can quickly know that this data is available in documents 2 and 3 , so we can retrieve it.

is not very simple, the key question is 数据 the retrieval time, how to quickly locate the third row of data, here is used 哈希表 , so, a complete inverted row consists of two parts, part of the table above, the second is a hash table, through this hash table can know the 数据 word subscript for 3to find 2,3 the two documents.

哈希表Implementation is not detailed, 哈希表 there are many ways to implement, and hash function has a lot of implementations, in short, for a keyword positioning
-First, by calculating the hash of the keyword, get its subscript
-Then, look for the subscript of the inverted index to get the list of document IDs

In the code InvertIndex.go is the inverted index of the data structure, StringIndexDic.go is the key word hash table, the two files generated by the data will be serialized into a json file storage.

Positive row index

Row index relative inverted row is much simpler, is actually a dictionary file, is key docid, value is this docid corresponding content, mainly used to do the result set of filtering, so-called 倒排检索，正排过滤 , what scenes need such things? You must have experienced the following scenario.

You are in a certain East site search 运动鞋 , definitely out of a pile of shoes, but you just want to see nike the shoes, this time you can add Nike to the back of the sneakers, search nike运动鞋 , but the results are not necessarily accurate, because not every Nike shoes will be written on the title of Nike, This is the time to use the line, he will put Nike shoes for you to filter out.

The positive row index is an array, the subscript of the array is DocId , the file NumberProfile.go and is a TextProfile.go specific implementation file

Detail file

The detail file is implemented using Redis, with no specific data structure, and is actually implemented with the primary key ID as key.

Incremental update

The incremental update uses a last_modify_time field that scans MySQL, gets the data, compares it to redis the data in it, and adds it to the index if it is updated, and adds the index as follows:

If it is a positive typesetting update, and is not the new data, just the original data modification
- Direct update of DOCID corresponding subscript data
If it is a positive typesetting segment update, but the new data
- Add a new docid to the back of the file
If the inverted typesetting segment is updated
- Remove the original docid from the BitMap
- Added a docid and added to the back of the inverted file

Because DOCID is continuous, inverted typesetting paragraph update, to modify the inverted list, and the current inverted list is an array, so directly set up a bitmap, the corresponding docid deleted, the subsequent changes to the form of the list can be dynamically deleted.

An incremental update uses a go coprocessor to do the scan of a database field, which can then be changed to kafka get incremental updates from fetching data or other means

Data retrieval

Data retrieval is divided into the following steps

Get the DocId chain from the inverted index according to the keyword, and ask for the intersection when there are multiple keywords
By BitMap filtering out the deleted docid
Finally, the docid is filtered according to the conditions of the file and the final docid chain is obtained.
DocId the real ID of the document and obtains the details of the document through Redis for display

The IndexSet.go main implementation of the above steps in the file

Cond.......

Implement a simple search engine with the go language

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implement a simple search engine with the go language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implement a simple search engine with the go language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support