Implement a simple search engine with the go language

Source: Internet
Author: User

Implement a simple search engine with the go language

Project address is: Https://github.com/wyh267/FalconEngine

Interested in the search engine can go to see this book, relatively shallow and more complete introduction of a search engine of all functions.

My search engine raw data is the MySQL database, you can do two times according to the needs of development, to support other databases or local files, detail files are stored in the Redis database, the same part can be developed according to their own needs two times, using local files or other databases, Inverted index and row index local storage when using the JSON format, compared to disk, the first version of the temporary so, and then do optimization.

The use method relies on several libraries
    • Github.com/outmana/log4jzl log File
    • Github.com/ewangplay/config configuration file parsing
    • Github.com/go-sql-driver/mysql MySQL Driver
    • Github.com/garyburd/redigo/redis Redis Drive
    • Github.com/huichen/sego word breaker, author homepage thank him very much for his analyzer, his homepage also has a search engine, did not see the concrete implementation, everyone interested can go to see.
Compile
    • Run directlyinstall.sh
Run
    • Get a dictionary file of participle from Github.com/huichen/sego
    • Run indexer to generate index files into index directory

Bin/falconengine-mode=build

    • Run the Finder

Bin/falconengine-mode=search

Basic concepts

The following concepts need to be understood, if not understood at all, also need to own a little Baidu:,,,,, 倒排索引 正排文件 Detail文件 全量索引 增量索引 哈希函数 ,DocId

Basic data structure

Search engine first is not mysterious, the basis of the data structure is so few, fixed later is in the above building blocks.

If you have the following five documents to search

Document Number content
1 Hello, search engine
2 Search engines have a piece of data
3 Hello, there is a test data
Inverted index

Inverted index is the basis of the Search Engine Foundation, the main retrieval is from the inverted index start, so, first of all, design a inverted index data structure is the basis of all search engines.

The search engine is based on the docid, that is, the document ID,DOCID is unique and continuous, and the inverted index is a set of docid linked list, each linked list corresponds to a keyword.

The underlying structure of the above document build number inverted index is as follows

Key Words Document Number
How are you doing 1,3
Search engine The
Data 2,3
There is a piece 2,3
Test 3

So, when we retrieve the word, we 数据 can quickly know that this data is available in documents 2 and 3 , so we can retrieve it.

is not very simple, the key question is 数据 the retrieval time, how to quickly locate the third row of data, here is used 哈希表 , so, a complete inverted row consists of two parts, part of the table above, the second is a hash table, through this hash table can know the 数据 word subscript for 3to find 2,3 the two documents.

哈希表Implementation is not detailed, 哈希表 there are many ways to implement, and hash function has a lot of implementations, in short, for a keyword positioning
-First, by calculating the hash of the keyword, get its subscript
-Then, look for the subscript of the inverted index to get the list of document IDs

In the code InvertIndex.go is the inverted index of the data structure, StringIndexDic.go is the key word hash table, the two files generated by the data will be serialized into a json file storage.

Positive row index

Row index relative inverted row is much simpler, is actually a dictionary file, is key docid, value is this docid corresponding content, mainly used to do the result set of filtering, so-called 倒排检索,正排过滤 , what scenes need such things? You must have experienced the following scenario.

You are in a certain East site search 运动鞋 , definitely out of a pile of shoes, but you just want to see nike the shoes, this time you can add Nike to the back of the sneakers, search nike运动鞋 , but the results are not necessarily accurate, because not every Nike shoes will be written on the title of Nike, This is the time to use the line, he will put Nike shoes for you to filter out.

The positive row index is an array, the subscript of the array is DocId , the file NumberProfile.go and is a TextProfile.go specific implementation file

Detail file

The detail file is implemented using Redis, with no specific data structure, and is actually implemented with the primary key ID as key.

Incremental update

The incremental update uses a last_modify_time field that scans MySQL, gets the data, compares it to redis the data in it, and adds it to the index if it is updated, and adds the index as follows:

    • If it is a positive typesetting update, and is not the new data, just the original data modification

      • Direct update of DOCID corresponding subscript data
    • If it is a positive typesetting segment update, but the new data

      • Add a new docid to the back of the file
    • If the inverted typesetting segment is updated

      • Remove the original docid from the BitMap
      • Added a docid and added to the back of the inverted file

Because DOCID is continuous, inverted typesetting paragraph update, to modify the inverted list, and the current inverted list is an array, so directly set up a bitmap, the corresponding docid deleted, the subsequent changes to the form of the list can be dynamically deleted.

An incremental update uses a go coprocessor to do the scan of a database field, which can then be changed to kafka get incremental updates from fetching data or other means

Data retrieval

Data retrieval is divided into the following steps

    • Get the DocId chain from the inverted index according to the keyword, and ask for the intersection when there are multiple keywords
    • By BitMap filtering out the deleted docid
    • Finally, the docid is filtered according to the conditions of the file and the final docid chain is obtained.
    • DocId the real ID of the document and obtains the details of the document through Redis for display

The IndexSet.go main implementation of the above steps in the file

Cond.......

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Implement a simple search engine with the go language

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.