Write a search engine (0x06)---index build with Golang

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Unconsciously written to the seventh article, according to this rhythm, estimated to write 15 to 20 about to write, I hope I can persist, before writing code when a lot of things did not think so meticulous, and now every write an article to check some information, to ensure that the accuracy of the article, but also the equivalent of a review of it, hehe.

First of all, about the inverted file, in fact, there are many things did not say, to the back of the unified add it, mainly inverted file compression technology, this part because the current storage space, whether it is hard disk or memory is very large, so compression technology is not used much.

Today we are talking about the construction of inverted index.

Before, we learned that the inverted index was stored in the system like this.

The B + Tree above is a file, the following inverted chain is a file, then, how to build these two files, this chapter I will talk about general construction methods, and then say how I built.

In general, the search engine by default will think that the index is not much change, so the index is divided into full-scale index and incremental index two parts, the full-scale index is usually in days or even weeks, months to build the unit, built after the import into the engine for retrieval, and the incremental index is real-time into the search engine, Many of them are stored in memory, searched for data from a full-scale index and a Delta index, and then merged two pieces of data back to the requester, so the incremental index is not the main content of our article, and in the final part of my index build I'll say my incremental index build. Now look at the full-scale index first.

There are two ways to build a full-scale index:

Build indexes Once

A one-time build index, which is to scan all the documents in full quantities, and then store all the indexes in memory until all the documents are scanned, the index is built in memory, and then once again written to the hard disk. The approximate steps are as follows:

    • Initializes an empty map, and the map key is used to hold the Term,map value as a linked list to hold the DocId chain

    • Set the value of DocId to 0

    • Reads a document's contents, sets the document number to DocId

    • Cut the word operation on the document and get all the term (t1,t2,t3 ...) for this document.

    • Inserts all of the term <term,docid> key pairs into the map's key, DocId appended to the value of map

    • DocId plus 1

    • If there is a document that is not read, return to the third step, or continue

    • Traverse the <key,value> in the map, write value to the inverted file, and record the offset of this value in the file, and then write <key,offset> to the B + Tree

    • Index build complete

The graph is the next few steps.

If it's a pseudo-code, that's it.

//初始化ivt的map 和 docid编号var ivt map[string][]intvar docid int = 0//依次读取文件的每一行数据for content := range DocumentsFileContents{  terms := segmenter.Cut(content) // 切词  for _,term := range terms{      if _,ok:=ivt[term];!ok{         ivt[term]=[]int{docid}      }else{         ivt[term]=append(ivt[term],docid)    }    docid++}//初始化一棵B+树,字典bt:=InitBTree("./index.dic")//初始化一个倒排文件ivtFile := InitFile("./index.ivt")//依次遍历字典for k,v := range ivt{  //将value追加到倒排文件中,并得到文件偏移[写文件]  offset := ivtFile.Append(v)  //将term和文件偏移写入到B+树中[写文件]  bt.Add(term,offset)}ivtFile.Close()bt.Close()  }

As a result, the inverted file is built, here I directly use the map as a description, just to let everyone more intuitive understanding of the construction of a inverted file, in practice may not use this data structure.

Build in batches, merge in turn

A one-time build way, because it is put so that the document is loaded into memory, if the machine's memory space is not large enough, it will lead to build failure, so the general situation does not adopt that form, a lot of index building methods are built in this batch, followed by the way of merging, this way mainly in the following way

    • Request a fixed-size memory space for dictionary data and document data

    • Initialize a sortable dictionary in fixed memory (can be a tree, or it can be a jumping table, or a list, can be sorted on the line)

    • Set the value of DocId to 0

    • Reads a document's contents, sets the document number to DocId

    • Cut the word operation on the document and get all the term (t1,t2,t3 ...) for this document.

    • Inserts a term into the dictionary in order, and generates multiple <term,docid> key-value pairs <t1,docid>,<t2,docid> in memory, and stores these key-value pairs in memory Document data , while ensuring that key-value pairs are sorted by term

    • DocId plus 1

    • If the memory space is exhausted, write the document data to disk, emptying the in-memory document data

    • If there is a document that is not read, return to the third step, or continue

    • Because the key value pairs in each disk file are arranged in the order of the term, the various disk files are merged by the multi-path merging algorithm, each term's inverted chain is generated during the merging process, and an appended write-down file is added, and the term file offset is generated with the dictionary until all the files are merged. The dictionary is also built to follow.

    • Index build complete

Again, we use a graph to show that this is the way it is.

If you use pseudo-code, it is the following, the code flow is very simple, combined with the above steps and graphs to look closely to understand

//初始化固定的内存空间,存放字典和数据dic := new DicMemory()data := new DataMemory()var docid int = 0//依次读取文件的每一行数据for content := range DocumentsFileContents{  terms := segmenter.Cut(content) // 切词  for _,term := range terms{      //插入字典中      dic.Add(term)      //插入到数据文件中      data.Add(term,docid)      //如果data满了,写入磁盘并清空内存      if data.IsFull() {          data.WriteToDisk()          data.Empty()    }    docid++}//初始化一个文件描述符数组idxFiles := make([]*Fd,0)//依次读取每一个磁盘文件for idxFile := range ReadFromDisk {    //获取每一个磁盘文件的文件描述符,存到一个数组中    idxFiles.Append(idxFile)}//配合词典进行多路归并,并将结果写入到一个新文件中ivtFile:=InitFile("./index.ivt")dic.SetFilename("./index.dic")//多路归并KWayMerge(idxFiles,ivtFile,dic)//构建完成ivtFile.Close()dic.Close()  }

Above is the two methods of building a full-scale index, for the second method, there is a special case, that is, when the memory of the dictionary is also very large, the memory of the explosion of how to do, this can be the dictionary also step to write to disk, and then in the dictionary of the merger, here do not say, interested can find their own.

I said above these and some search engine books may say the same, but the basic idea should be similar, in order to let everyone more intuitive grasp of the essence, a lot of special situation I did not elaborate, after all, this is not a purely theoretical article, If you are really interested in sure you can find a lot of ways to get to know the search engine more deeply.

The above-mentioned multi-way merge, is a standard out-of-order method, everywhere can find information, here is not detailed unfolded.

In addition, there are some details in the construction of the index, such as the general Index building is two scanning documents, the first time to generate some statistics, that is, the last word of the message, such as TF,DF, the second scan to start the real construction, so that, It is possible to put the calculation of term relevance in the time of building index, so the retrieval efficiency can be greatly improved by simply sorting and not calculating correlation.

My approach to building

Finally, I say how I built the index, because I wrote this search engine, there is no clear distinction between the full-scale and incremental index concept, the decision to the upper level of the engine layer to decide, so at the bottom of the index at the time there is no total increment of the concept, Therefore, the first and the second method are combined to build the index.

    • First set a threshold, such as 10,000 documents, within the scope of these 10,000 documents, the first way to build the index, generate a dictionary file and an inverted file, this set of files called a segment (segment)

    • Each of the 10,000 documents generates a segment (segment)until all the documents are built, resulting in multiple segments, and the incremental data is built in this way after the search engine starts, so there are more segments

    • Each segment is a part of the index, he has the inverted index of all things (dictionary, inverted table), you can perform a normal retrieval operation, each time the retrieval of the search for each segment, and then merge the results is the final result

    • If the number of segments is too large, according to the second way of thinking, multiple paragraphs of the dictionary and inverted file for multi-merge operations, because the dictionary is ordered, so you can follow the order of the term to merge operations, each time the merging of the full pull out, and then generate a new dictionary and new inverted file, When the merger is over, erase the old.

The above merge operation policy completely to the upper level of the engine layer or even the business layer to complete, some scenarios under the incremental index less, then the first build up the index can be merged together, the incremental index every certain time merge once, some scenarios under the data has been constantly into the system, then you can pass some strategies, The efficiency of the retrieval is ensured by merging a subset of indexes during system idle time.

OK, above is the index construction method, to this completion, inverted index of the data structure, the construction of the way all said, but there are a lot of bits and pieces did not say, the following will be unified to some of the places not mentioned in a paper to say, next, I will use one to two articles to say a row index, Then you can go across to the search layer.

Finally, welcome to sweep the attention of my public number ha:)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.