Write a search engine (0x08)---the segment of the index with Golang

Source: Internet
Author: User
Tags map data structure
This is a creation in Article, where the information may have evolved or changed.

I think this title should be changed, I write down in fact is to tell you how to write a search engine, and did not involve too much golang things, I think this is also very good, familiar with the principle, with what implementation is actually not important, and said that the principle is more real than the code.

The underlying data structures have been mentioned before, including inverted and positive indexes. Today we're on the next level, say the indexed fields and segments.

Field This previous article has been introduced, the concept of the field is actually the search engine index we can see the lowest level of things, but also the bottom of the concept of external exposure, in the field is inverted and positive row index, these two are actually packaged up for the user, we can think of each field corresponding to a row and a inverted row, And that's actually true.

Above the field is our main paragraph , paragraph this concept is not unique search engine, is not necessary, is my new project, of course, is not my original, many search engine system has this concept.

The so-called paragraph, is the most basic retrieval system, a segment contains all the fields, including a part of a continuous collection of documents, to be able to complete the retrieval, it can be used as a retrieval system the most basic unit.

This may be a bit abstract, we make an analogy, in the database, a row of data is the most basic unit, the corresponding search engine is a document, and the table is a collection of all documents, the corresponding search engine is an index, and the section is a part of the table, it contains a portion of the document content, you can retrieve this part of the document Multiple segments are combined to be a complete index.

Why do we have this concept of paragraph?

    • If the data of a search engine does not change after the index has been built, then there is no need to use segments, just set up the data when the whole index is built.

    • If there is incremental data, and incremental data is constantly entering the system, then the concept of the paragraph is necessary, the new data is first saved in memory, and then periodically generate a segment, persisted to the disk to provide retrieval operations.

    • Another advantage of the paragraph is that when the system is a distributed system, when indexing synchronization, because the individual segments will not change after the persistence, only need to copy the segment to each machine, you can provide retrieval services, do not need to rebuild the index on each machine.

    • A fragment is damaged, does not affect the retrieval of other paragraphs, only need to copy this section from other machines to be able to retrieve the normal, if there is only one index, once the index is broken, you can not provide retrieval services, need to wait for the correct index copy.

What information is stored in a paragraph?

A segment contains several files

    • Indexname_{segementnumber}.meta here is the meta-information for a segment, including the name of the field in the segment, the type, and also the starting and ending numbers of the document for the segment.

    • INDEXNAME_{SEGEMENTNUMBER}.BT Here is the dictionary file for the inverted index of the segment

    • Indexname_{segementnumber}.idx here is the inverted file for all fields of the segment

    • INDEXNAME_{SEGEMENTNUMBER}.PFL here is the data for all the numbers in the segment, and also contains the location information of the string type data

    • INDEXNAME_{SEGEMENTNUMBER}.DTL here is the detail data for the string type data of the segment

The indexname above is the name of the index, which corresponds to the table name in the database, Segmentnumber is the segment number, and this number is system generated.

Multiple segments together are a complete index, which is actually retrieved at the time of each segment, and then merging the data together is the final result set.

Construction of segments

Let's talk about these files one by one and see how a bunch of rows and piles of inverted rows form a segment.

The construction of a real segment is constructed from the following steps, and we illustrate the construction of the segment with a practical example, such as the index structure, which includes three fields, namely, name (String), age (number), Self-introduction (string with participle), So the steps for building segments and indexes are like this

1. Pre-preparation

To create a new segment first, we need to initialize a segment first, and when we initialize the segment we actually know what fields the segment contains, the type of each field.

    • Initializes a piece of information that contains the field information and type contained in the segment, where it contains the name (the string "positive row and inverted"), the Age (the number "positive row"), the self-introduction (the string with the word "row and inverted").

    • Give the paragraph a number, such as 1000.

    • Ready to start receiving data.

2. Create an in-memory segment

The In-memory segment is the first step in the build segment, with the above field information as an example, we will build the following data structures in memory, where I use the language automatic raw data structure

    • The name needs to be inverted index, so create a map<string,list>,key is the name, value is DocId, name also to establish a positive row index, so create a stringarray[], save the name of each data details.

    • Age needs to establish a positive row index, so create a integerarray[], save details of the age of each piece of data.

    • Self introduction need to set up an inverted index, so the establishment of a map<string,list>,key is the self-introduction of the word Term,value is docid, self-introduction also to establish a positive row index, so build a stringarray[], Save details about the self-introduction of each piece of data.

When we add a new piece of data {"name":"张三","age":18,"introduce":"我喜欢跑步"} , first we give him a docid "if it is 0", and then we put the data into the above 5 structure, if another piece of data {"name":"李四","age":28,"introduce":"我喜欢唱歌"} , we give him a docid "if it is 1", then the data becomes the appearance

3. Persisting data structures to disk

In this way, as the data is constantly imported, the data structure in memory is constantly changing, the data of the memory segment is getting bigger and larger, when a certain threshold is reached (this part of the strategy will say, I put this part of the strategy to the engine layer, by the engine to decide when to persist the segment), we will persist the data to disk.

In the process of persistence

    • If it is a map data structure, we will traverse the entire map, append the value to the. idx file, and then set the key to the B + tree, where value is the offset from the idx file that was just written.

    • In the case of Integerarray, we traverse the entire array and write the data to the PFL file, with each data occupying 8 bytes.

    • If it is stringarray, we traverse the entire data, first append value to the DTL file, and then write the file offset to the PFL file

Complete the above three steps, our persistence work is finished, the data structure will become the following after the completion of the appearance, we can realize in their own minds.

4. Once the segment is built

Once the segment has been built, the segment is fully persisted and no changes are made, which is equivalent to being submitted to the index system and can be retrieved. At this point, we create a new segment, then receive the new document data, and then continue to persist the subsequent segments to disk.

When retrieving, retrieve each segment sequentially, then merge the result set back to the front end.

Merging of segments

After the paragraph is established, it may be necessary to merge the segments, there are many ways to merge the segments, the simplest is to create a new segment, and then traverse all the previous data, create a new segment, which is more suitable for the situation of less data, because the new segment is in memory, if the previous data too much, Memory will not hold.

There is another way is to separate the inverted, positive row merge, this method does not consume memory, but compared to disk-consuming IO, two ways you can choose according to their own business scenarios, the first method and the previous section of the construction is the same, here we say the second way.

Merging inverted files

We use the B + tree to store the inverted index dictionary file, B + Tree Natural Band sort, then the merging segment is actually merging multiple B + trees, we can merge multiple B-trees as long as we use the merging sort method. Merge sort not clear can own to check, each B + Tree key is to be merged elements, while scanning the B + tree side to build a new B + tree, and then the inverted file merged together to form a new IDX file, the inverted file is merged.

Merge a file with a positive row

Merging the file is more simple, just follow the fields in order to iterate through each segment of the file, and then one side of the loop to form a new file, the traversal of the file will be completed.

The Merge method FalconIndex/segment/segment.go MergeSegments has detailed code in it, so you can refer to the simplest form of merging.

The policy of the segment

section of the strategy is relatively free, generally not recommended curing into the index. In general, there are several strategies to choose from, depending on your business logic to choose a suitable segment of the persistence strategy.

    • If your system is a system that once built the index does not change, then in the full index of the time to establish a segment on the line, the full index is built, and then the segment persisted to disk on the line, if the full amount of index is large, afraid of memory can not carry, then each 100,000 to create a segment, When the full index is completed and then all the segments are merged into a single segment, then the merge segment will say that the merge section basically does not occupy any memory, can be merged at any time, if there is incremental data, every time to serialize the paragraph, and then every time, and then all the non-full amount of data to merge segments, Then the system basically only a whole number of segments and an incremental segment, retrieval is very fast.

    • If your system is a large-scale real-time changes in the system, such as the log system, then the whole index is meaningless, because the log system retrieval is not so high real-time requirements, then the strategy of the paragraph can be each new 100,000 data persisted a segment, not to 10 segments to merge all segments into one segment. Or you can merge segments by time stamp to easily remove old data.

    • If your system is a system with very high real-time requirements, you can persist the segment by time (for example, 10 seconds), merging small segments into a large segment whenever the system is idle.

In short, the strategy of the segment is relatively free, entirely by the engine layer to implement, according to their own business scenarios to choose to rewrite a piece of the merged strategy is possible.

Section is a part of the index, is also a miniature index, the following article we will introduce the index layer, the index layer introduction to play after the search engine data layer is completely finished, the above is a variety of engine strategy, with the index layer, in fact, you want to become a search engine or to become a database, or become a KVDB database can be, anyway, the basis of things will not be too much change.

Well, if you want to read the previous article, you can follow my public number, ha:)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.