Implement a simple search engine with the go language
Project address is: Https://github.com/wyh267/FalconEngine
Interested in the search engine can go to see this book, relatively shallow and more complete introduction of a search engine of all functions.
My search engine raw data is the MySQL database, you can do two times according to the needs of development, to support other databases or local files, detail files are stored in the Redis database, the same part can be developed according to their own needs two times, using local files or other databases, Inverted index and row index local storage when using the JSON format, compared to disk, the first version of the temporary so, and then do optimization.
The use method relies on several libraries
- Github.com/outmana/log4jzl log File
- Github.com/ewangplay/config configuration file parsing
- Github.com/go-sql-driver/mysql MySQL Driver
- Github.com/garyburd/redigo/redis Redis Drive
- Github.com/huichen/sego word breaker, author homepage thank him very much for his analyzer, his homepage also has a search engine, did not see the concrete implementation, everyone interested can go to see.
Compile
Run
- Get a dictionary file of participle from Github.com/huichen/sego
- Run indexer to generate index files into index directory
Bin/falconengine-mode=build
Bin/falconengine-mode=search
Basic concepts
The following concepts need to be understood, if not understood at all, also need to own a little Baidu:,,,,, 倒排索引
正排文件
Detail文件
全量索引
增量索引
哈希函数
,DocId
Basic data structure
Search engine first is not mysterious, the basis of the data structure is so few, fixed later is in the above building blocks.
If you have the following five documents to search
Document Number |
content |
1 |
Hello, search engine |
2 |
Search engines have a piece of data |
3 |
Hello, there is a test data |
Inverted index
Inverted index is the basis of the Search Engine Foundation, the main retrieval is from the inverted index start, so, first of all, design a inverted index data structure is the basis of all search engines.
The search engine is based on the docid, that is, the document ID,DOCID is unique and continuous, and the inverted index is a set of docid linked list, each linked list corresponds to a keyword.
The underlying structure of the above document build number inverted index is as follows
Key Words |
Document Number |
How are you doing |
1,3 |
Search engine |
The |
Data |
2,3 |
There is a piece |
2,3 |
Test |
3 |
So, when we retrieve the word, we 数据
can quickly know that this data is available in documents 2 and 3 , so we can retrieve it.
is not very simple, the key question is 数据
the retrieval time, how to quickly locate the third row of data, here is used 哈希表
, so, a complete inverted row consists of two parts, part of the table above, the second is a hash table, through this hash table can know the 数据
word subscript for 3
to find 2,3
the two documents.
哈希表
Implementation is not detailed, 哈希表
there are many ways to implement, and hash function has a lot of implementations, in short, for a keyword positioning
-First, by calculating the hash of the keyword, get its subscript
-Then, look for the subscript of the inverted index to get the list of document IDs
In the code InvertIndex.go
is the inverted index of the data structure, StringIndexDic.go
is the key word hash table, the two files generated by the data will be serialized into a json
file storage.
Positive row index
Row index relative inverted row is much simpler, is actually a dictionary file, is key
docid, value
is this docid corresponding content, mainly used to do the result set of filtering, so-called 倒排检索,正排过滤
, what scenes need such things? You must have experienced the following scenario.
You are in a certain East site search 运动鞋
, definitely out of a pile of shoes, but you just want to see nike
the shoes, this time you can add Nike to the back of the sneakers, search nike运动鞋
, but the results are not necessarily accurate, because not every Nike shoes will be written on the title of Nike, This is the time to use the line, he will put Nike shoes for you to filter out.
The positive row index is an array, the subscript of the array is DocId
, the file NumberProfile.go
and is a TextProfile.go
specific implementation file
Detail file
The detail file is implemented using Redis, with no specific data structure, and is actually implemented with the primary key ID as key.
Incremental update
The incremental update uses a last_modify_time
field that scans MySQL, gets the data, compares it to redis
the data in it, and adds it to the index if it is updated, and adds the index as follows:
If it is a positive typesetting update, and is not the new data, just the original data modification
- Direct update of DOCID corresponding subscript data
If it is a positive typesetting segment update, but the new data
- Add a new docid to the back of the file
If the inverted typesetting segment is updated
- Remove the original docid from the
BitMap
- Added a docid and added to the back of the inverted file
Because DOCID is continuous, inverted typesetting paragraph update, to modify the inverted list, and the current inverted list is an array, so directly set up a bitmap, the corresponding docid deleted, the subsequent changes to the form of the list can be dynamically deleted.
An incremental update uses a go coprocessor to do the scan of a database field, which can then be changed to kafka
get incremental updates from fetching data or other means
Data retrieval
Data retrieval is divided into the following steps
- Get the DocId chain from the inverted index according to the keyword, and ask for the intersection when there are multiple keywords
- By
BitMap
filtering out the deleted docid
- Finally, the docid is filtered according to the conditions of the file and the final docid chain is obtained.
- DocId the real ID of the document and obtains the details of the document through Redis for display
The IndexSet.go
main implementation of the above steps in the file
Cond.......
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Implement a simple search engine with the go language