Write a search engine (0x01)---basic concepts with Golang

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Starting from scratch, write a search engine (0x01)

Part 0th We have made an outline, and this article begins with a detailed talk.

Search Engine Basic Concept

Before we talk about the search engine architecture layering, we first identify several search engine concepts.

    • 文档, the basic data unit of a search engine, such as a Web page, a product, multiple documents together is a search engine complete data

    • 倒排索引, the 正排索引 data structure stored inside the search engine is the lowest data structure of the search engine.

    • 索引器, the 文档 program that generates the data to be searchable 倒排索引 正排索引 is the indexer.

    • 检索器, and find 倒排索引 the program by searching for and 正排索引 文档 .

    • 字段, each document may have multiple fields, such as an article with title, author, summary, details, release time , these things although in one 文档 , but the search needs to be treated differently.

    • 索引, multiple 文档 by 索引器 generating a bunch of inverted row index, we put these inverted row index of the set called 索引 , if referred to as refers to 索引 the positive row and inverted index collection, the index can also be understood as a table in the database .

Well, the above is the most basic search engine concept, the search engine is simply divided into two parts, part is 索引器 the document into 倒排 and 正排 file, the second part is 检索器 through 倒排 and 正排 file restore 文档 process.

Search engine Design Layering

Database is also a search engine, but the focus of the database and search engine is not the same, search engine is simple, fast, and the database is the pursuit of stability and complex logic to the data processing, so the application scenario is not the same.

Now that you know the basic concept of a search engine, how should we design this search engine?

According to the general software design logic, if it is not a very complex system, to design a system, the first to design a data structure, and then encapsulate the data structure into the various algorithms and classes, and then combine the various classes to complete, so, I design this engine is based on the following several levels.

    • First of all, a layer of data is needed to store 倒排索引 and 正排索引 , each field will have a corresponding inverted and positive index, which should have a series of related modules to achieve;

    • The combination of all the fields 倒排索引 正排索引 is the entire data file, then need some modules to manage these things;

    • Then because in the first article we said Hope 索引器 and 检索器 all are here, similar ElasticSearch implementations, so also use a segmented way to manage the document 索引 , each to a certain condition will be 索引 flushed to disk or merge the index together.

    • We also need an engine to manage multiple, the 索引 engine is responsible for complex query strategy and sequencing strategy, this engine can be replaced and modified, just to implement the standard interface on the line, you can also implement to replace the default engine to achieve more functions, even you can write an engine, to implement SQL query.

    • Finally, a layer of interaction with the outside world is required, and I'm implementing an HTTP server to interact with the external data, and json this layer can also be rewritten to whatever you want.

According to the above modules, a search engine, in the overall architecture of about divided into the following several levels

    • First, the bottom is the Data module layer, responsible for all the data in the engine description, for the search engine, the data is divided into 倒排索引 and 正排索引 , also called 逆向索引 and 正向索引 , in order to facilitate, we this unified called 倒排索引 and 正排索引 .

    • Above this layer is the field layer, where each field corresponds to a positive row index and an inverted index (optional), because some fields need to be displayed and do not require a search to be inverted.

    • The fields are managed by a segment layer before the field layer, some in memory, and some flushed to disk.

    • Above the segment layer is the Index module layer, which provides some basic additions, deletions, modifications, and lookups to the interface.

    • Above the index module layer is the engine layer, which implements the specific business lookup logic.

    • At the top is a network layer, which is responsible for interacting with the outside world.

In the implementation, in order to be as simple as possible, each module is basically a file to achieve, with the Golang later, the code is also free to write, as long as the clear line, so the whole engine down, the code is not very large.

The following diagram is a large-frame diagram of the entire code, and we'll talk about these things one after the other.

Through this article, hope to have a preliminary understanding of the search engine, I will be a module a technical point to split this search engine.

The next article will begin to introduce the search engine of the underlying technology, inverted index "will spend more effort to inverted index, after all, this isthe core of the core", the row index "This simple, is the array", There will be several articles in the introduction of the implementation of inverted index technology needs some of the algorithms and data structures.

The code is hosted on github , the address is https://github.com/wyh267/FalconEngine, the code is constantly updated, the current code, minus the unit test file, about 6000 lines, because it is to think where to write, some functions are useless, So there is the space to optimize, but also want you to submit your patch, the back of the time can be compared to the code to see.

The code structure is as follows, again, the code is small, the structure is very simple, although I write the code comments are not many, but do not use any advanced features, closures are not, directly see no difficulty.

The update frequency of the article will be around 3 to 5 articles a week, you are welcome to scan the following public number subscription, first issued here:)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.