Write a search engine with Golang (0X07)---positive row index

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.


Recently, a variety of technical event too many, friends circle a variety of brush screen, there are a variety of manufacturers hair ads, a variety of self-portraits of the lecturers, all kinds of friends attending all kinds of selfies, not busy, do not know your friends circle is not so ah, last year, not so many technical meetings, this year, it is a good time , but also have a variety of tear B can see, really want to 818, afraid of offending, we this code farming or focus on technology to write code it.



What do you want to know can also give me a message ha, Welcome to exchange, my work before the main do is search, also do recommendations and ads, this part of the things may be written more points, on the embedded field also line (cross a bit big, this embedded is not iOS and Android, is really embedded), no high-end background, Is not bat this big factory, is a small company write code, so there are a lot of things still do not understand, if you and I exchange the discovery I can not answer up very normal Ah, people difficult not to dismantle AH.



This article is also relatively long, but there is not much dry, suggest to go to the toilet when watching, or on the subway while listening to the song while watching.



In the previous few, basically the inverted index of the data structure to the end, and simply say a bit about the sort, and then said the inverted index construction. This article mainly writes about the positive row index and how the inverted and positive rows fit together to form a complete field index.



Positive row index



The positive row index, also called the Forward index, and the inverted index (also called the reverse index) is relative, the positive row index is more simple than the inverted row, the second article has the following two tables (table 1 and table 2)



This is table 1.


Document Number Document Content
1 This is a go language implementation of the search engine
2 PHP is the best language in the world
3 Linux is implemented in C and Assembly languages
4 Google is one of the best search engine companies in the world


This is table 2.


Key Words Document Number
Go 1
Language The
Realize 1,3
Search engine 1,4
Php 2
World 2,4
Best 2,4
Assembly 3
Company 4


We've been talking about it. Table 2 as the inverted index, for table 1, we think is the data details (detail) information, and finally used to do the data content display, if it is in a search engine that only supports full-text search, it is really table 1 is used to do the final data display, But if our search engine also wants some complicated functions, then table 1 is a positive index, if our search engine supports both inverted index and positive index, we can simply think of this as a database system (of course, and the real database is far worse).



First, let's see what happens when we want to use a positive row index



Obviously, if the inverted index does not meet the search requirements, you need to introduce a positive row index, such as an e-commerce search engine, then the positive row index is necessary, if we have the following items need to be shelves:


Product number Product Title Release Time Price Brand
10001 Hammer Phone T9 2026-06-06 5000 Hammer
10002 Xiaomi Mobile 10 2020-02-02 1999 Millet
10003  Mobile P20 2022-12-12 3999


Search when we may need to search for the price in a range of mobile phones, then just full-text inverted index is more difficult to complete the task, and we use the e-commerce search engine, often in the search results on the top of the above to see some aggregated information "such as brand, model, price summary", This part of the thing is also done by a positive row index, like the following figure






So, if our search needs are not just for keyword matching, we need to do some filtering (such as the filter of the price range), summary operations (such as the statistics of the number of each brand in the result set), then you have to introduce a positive row index.



Second, let's look at how to implement a positive row index



There are two ways to implement a positive row index:



one is still based on inverted index , the previous inverted index is not built through B + Tree, B + Tree natural with the sorting function, so it is possible to do scope search, such as the above table, we want to search the key word for mobile phone, price range between 1500–4000 .


    • We set up an inverted row for the Price field and the product label inscription section respectively.

    • First, through the inverted index of the title, retrieve all the results of the product with the keyword of the mobile phone , they are "one-line"

    • Then the price range of the search, because the bottom of the B + Tree leaf node is connected by the pointer, we only need to traverse the leaf node through the pointer, we can traverse the price range of all the price of the inverted chain, and then the chain of these chains, the resulting set is "2,3", It is the product that satisfies the price range.

    • Finally, and the key words to find out the intersection of goods, is the final result.


This is the first way to achieve the summary of the operation you can think about how to do, but also can do, is the trouble point. There are several features of this implementation approach


    • There are no separate file files, and the inverted files are combined without taking up extra space.

    • But it limits the way the inverted index is implemented only with a sorted dictionary of B + trees, which cannot be done if the inverted file is implemented using a hash table.

    • When searching for interval search, it is necessary to make multiple request and set operation, which need to be optimized in efficiency.

    • Because there are only inverted files, the last time you use a data display, you need a secondary detail file or a database bound together for the final results to be displayed.


In addition to the above, there is a way to implement, is through an array to achieve, the following table of the array is the document number (DOCID, not the product number, the commodity number is the primary key), because in the search engine, docid is self-increment, and will not be deleted, so it is unique, Just can and a one-dimensional array of subscript pairs, so you can use an array to store the positive row index, like the table below, respectively, the price and brand set up a positive row index, in fact, the table 1 of the data is taken apart for storage. (in order to save space, I wrote two of them together)


DOCID Price DOCID Brand
0 5000 0 Hammer
1 1999 1 Millet
2 3999 2


If so, how do you do it when retrieving it? If it's the same search term for the phone, the price range is between 1500–4000


    • Only the title is inverted, the price field establishes a one-dimensional array of positive rows

    • First, through the inverted index of the title, the result set of all the products with the keyword of the mobile phone is retrieved, and their docid is "one-of-a-kind"

    • Iterate through the result set, each traversing a docid, directly through the one-dimensional array and the corresponding positive row file for comparison, see if the condition satisfies the left, unsatisfied discard.

    • After the traversal is completed, the final result set "2,3" is obtained.


If it is a summary operation, and similar to the above, in the second step to traverse the result set by the way can be counted, after the end of the statistics are finished.



All roads through Rome, through two different data structures, and finally got the same result, the second way has the following characteristics


    • To create a separate positive row index for the fields that need to be scoped, you cannot merge with the inverted data structure.

    • The result set needs to be traversed once by an inverted row, and then a new result set is taken as the final result, and if the result set is particularly large, it also takes time to traverse.

    • Because it is a one-dimensional array to implement the positive row, if the number of documents is very many, memory is not loaded with so many rows of files, you need to implement this one-dimensional array on disk.

    • If we create a positive row index for each field, then there is no need for a separate detail file or docking with the database, the direct row file together is a complete document information, less external dependencies.


The above is the two implementation of the index, which depends on the specific business needs, such as the full-text search engine such as Baidu, the main demand is actually looking for keywords, rarely used to filter, summary operations, then do not separate to achieve the index, the first way on the line, And if it is the type of e-commerce search engine, there is a lot of filtering ah, summary operations, then the second way to achieve a positive row index is more necessary.



My code is used in the second way, and implementation of the time is implemented in the Mmap way on the disk, if the memory is large enough, can be loaded into memory to improve the retrieval speed.



Index Design Management



Positive row index and inverted index are finally finished, this is the most critical data structure of the search engine, all the other things are developed on this basis, we already have the structure of the row and inverted index, then if to build an index system, I do so.



First of all, we need to set a rule, the so-called rule is our search engine which operations I support, which operations I do not support, for example, I support the full text search for the sake of simplicity, the others do not support, then only need to implement a good inverted index structure, the data structure part of the design is almost. When I was doing this search engine, I wanted to implement the following features.


    • Supports the inverted of keywords, and also supports the inverted row of the exact match type.

    • Filter operations are supported, but only integer types are supported (if floating-point numbers are converted to integers based on reserved decimal digits) and date types are filtered, only retrieval operations are provided for strings, and no filtering is provided.

    • For filtering operations, support is greater than, less than, equal to, not equal to, interval filtering.

    • Supports a rollup of fields.

    • Do not display data details from an external database system.


Since that is the way to achieve that, for each field, his possible type is


field Type Behavior Remark Example
Full-Match string Set up inverted row, positive row (positive row only display, do not filter operation) Primary KEY, model
Keyword string Set up inverted row, positive row (positive row only display, do not filter operation) Title, description
Digital Only positive rows are established Price, Stock
Date Only positive rows are established Shelves
Show only Only positive rows (positive row only display, no filtering operation) Product Details Description


Thus, when we implement it, we first implement an inverted index (SRC/FALCONINDEX/SEGMENT/INVERT.GO) and then implement a positive row index (src/falconindex/segment/ Profile.go), and then implement a field class (src/falconindex/segment/ FIELD.GO) is used to manage the inverted and positive row, then the most basic data structure of the search engine is OK, the outward inverted row and the row is hidden, only field class external exposure, for the retrieval operation mainly provides several interface methods:


    • Adddocument adding documents (creating a positive or inverted row)

    • Query retrieves documents by inverted rows

    • Filter filters Documents by positive row

    • GetValue to get the value of this field from a positive row file


In the article I try to be less or no code, mainly on the principles of the search engine to understand, the principle of understanding can be their own to implement the code, it will not be able to reference my code, after all, programming this thing as long as know the principle and algorithm, how to achieve is not trouble.



Write it in the back.



I have been doing C + + development, write the search code is also C + +, and now with Golang, there is no special difficulty, of course, because I am not very familiar with the characteristics of Golang, so basically no use of Golang advanced features, written out of the code is certainly not Golang fan, But that doesn't affect my implementation.



OK, the field section is finished, the core data structure of the search engine is also introduced, the following will continue to go up, first to the section layer , and then to the index layer , and then will say the search logic implementation, merging logic and so on, The index will continue to talk about the engine part of the search engine, there will be some data structure, such as Bitmap, Oh, will also write one to two pieces to introduce the Word segmentation , as for the sorting and indexing structure optimization will also come out alone.



In addition, my code basically completed, including distributed parts, will be recently submitted to GitHub, so there will be several for the search engine distributed implementation , or in the spirit of the original principle, no third-party library, so the distributed part of the Paxos this high-end theory, There is no zookeeper this kind of high-end gadgets, then everyone see it.



At present my Code preliminary test,8g,24 core of the machine,10 million data (micro-BO data, each not more than 140 words, I am not a Weibo person ha, there is no data leakage, the data is a one called the second climb of the Doctor climbed, I just came down to use it), the average retrieval time of a single term at 5ms, with AB for a single URL test, QPS about 7000, if it is a random keyword test, theQPS is approximately inthe years, basically achieve my own goals, And there's room for optimization. The next Test ElasticSearch, the current feeling than it reported the data to be faster, but the environment is not the same, the next part of a comparison, and the function is completely not up to the level of ElasticSearch, but its set to achieve it is no problem, Need is insisted, I will put this project maintenance, but recently was too busy, bitter force ah ...



Finally, continue to send a two-dimensional code, you understand, pay attention to the chant:)


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.