Write a search engine with Golang (0x02)---inverted index technology

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

This article, we say the search engine core technology, 倒排索引技术 Inverted index may need to be divided into several articles to say, we will first talk about the technical principle of inverted index, and then talk about how to use some data structures and algorithms to implement an inverted index, and then say a 索引器 how to pass 文档 To generate an inverted index.

Inverted index

What is an inverted index? Index we all know, is to be able to find the data structure of the document faster, such as to the document number, then by this number can quickly find a document, and the inverted index is not based on the document number, but through some words in the document to find the index structure of the document.

Inverted index technology is simple, efficient, is simply for the search engine this thing tailor-made, is relying on this technology, to achieve a search engine is possible, we can in a huge amount of articles through a keyword to find what we want.

Let's look at an example that has the following documents:

Document Number Document Content
1 This is a go language implementation of the search engine
2 PHP is the best language in the world
3 Linux is implemented in C and Assembly languages
4 Google is one of the best search engine companies in the world

Visually, we 1,2,3,4 can quickly find the document by number, but we need to find the document by keyword, then change the above table slightly, that is, inverted index

Inverted index "lists only some of the keywords"

Key Words Document Number
Go 1
Language The
Realize 1,3
Search engine 1,4
Php 2
World 2,4
Best 2,4
Assembly 3
Company 4

This is very good to understand, actually inverted index is to put the content of the document to regenerate a table, through this table, we can quickly find each keyword corresponding document, OK, no, here, is the core principle of inverted index, is also the cornerstone of the search engine most basic, Whether it is Google or some degree, the most important thing is the two forms, oh, no these two forms, nothing can do.

It looks pretty simple, okay, let's do a search on the simulation search engine, for example, we type the keyword搜索引擎
1. We found in table 2 that the word search engine appeared on line 4th
2. Find the 2nd column of line 4th and find the document number 1 and 4
3. Go to the first form to find out the actual contents of each document by document number
4. Show the results of 1 and 4
5. Search Complete

Above is the most basic technology of search engine, if to design a data structure and algorithm to implement table 2 is the key to search engine technology.

Before implementing the data structure and algorithm, we need to know that search engine search is a huge amount of data, the average medium-sized e-commerce data are dozens of hundred g of data, so this data structure should be stored in the local disk rather than in memory, based on the above considerations, in order to quickly search, Either cache the hot data yourself, or consider using the underlying technology of the operating system MMAP , and given that my own cache is not necessarily (and basically unlikely) better than the operating system, I'm using MMAP .

Mmap system Call

Mmap is to map a file or other object into memory. Files are mapped to multiple pages, and if the size of the file is not the sum of the size of all pages, the space that is not used by the last page will be zeroed out. After such a mapping relationship is implemented, the process can read and write the memory in a pointer way, and the system will automatically write back the dirty page to the corresponding file disk, which completes the operation of the file without having to call the system call function such as Read,write.

One of the biggest benefits of MMAP is that the operating system maps the files on the disk to memory itself, and when there is enough memory, the operation files are as fast as the memory, and when the memory is low, the operating system will remove some pages from memory and implement a cache-like thing. Especially suitable for large file read operations, and our inverted index file is such a huge file, and basically write once and then not too modified, each query read operation, so using mmap is a better choice.

Mmap is a system call, different operating system implementation differs, Linux under the corresponding C call method is the following, the specific parameter meaning we can man a bit:

Header file
Function prototypes
void mmap (void start,size_t length,int prot,int flags,int fd,off_t offset);

After a huge file mmap, the performance of the file read and write operation is determined by the system memory, the larger the system available memory, the better the performance of the Read and write files, because the operating system memory is sufficient, the system will load more files into memory, improve system throughput.

In the go language, the corresponding mmap call is: (need to introduce the Syscall package)

Func Mmap (fd int, offset int64, length int, prot int, flags int) (data []byte, err Error)

Parameters are: File descriptor, offset, length to be mapped, expected memory protection flag "read-only or write-only", "whether the map is synchronized to a file, or just a copy modification".

Because Mmap is a basic implementation and needs to be used in many places, a mmap class is implemented individually, providing some basic methods in Utils.mmap:

Func Newmmap (file_name string, mode int) (*MMAP, error) Create a new Mmap
Func (this *mmap) ReadInt64 (start Int64) Int64//reads a int64 value from the specified location
Func (this *mmap) WriteInt64 (start, value Int64) error//write a int64 value at the specified location
Func (this *mmap) Readdocidsarry (start, Len UInt64) []docidnode//reads a DOCID chain from the specified location
......

Large file read and write technology solution, in fact, the main problem is to solve the second column of Table 2, in a large number of documents, the second column of table 2 occupies the vast majority of disk space, we will table 2 into two data structures to store, the second column is a continuous storage file, called 倒排文件 , In the above example, we will save the second column as:

the
1 1,3 1,4 2 2,4 2,4 3 4

In the first column we will save the keyword and offset. In this way, table 2 is divided into two data structures, now the key is the first column to use what data structure can be guaranteed to quickly find the corresponding keyword in the query, so as to find the offset to get the second column of the specific data.

Well, now that there are a few players to play, they can all implement the first column of the structure, they are:,,, 顺序表 哈希表 查找树 前缀树 , next we look at their ability.

The update frequency of the article will be around 3 to 5 articles a week, you are welcome to scan the following public number subscription, first issued here:)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.