Write a search engine with Golang (0x00)---start from scratch

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Very early wanted to write a series of such articles, before an e-commerce company to do the search, some knowledge of search engines, came to a new start-up company is very happy to have the opportunity to continue to do this thing, although the field has changed, and not to do search, but the technology is the technology, and have the opportunity to contact the Golanguage, for a programmer for nearly 10 years C/C++ , Go some of the qualities that make me feel very comfortable, see this previous article.

Derived from the company's project a search engine project, and then have this article.

Let's talk about goals, I want to implement a search engine like this

    • Use the Go language implementation, easy to deploy, it is best to use a binary file to fix some, do not need to install any dependencies.

    • Similar to an e-commerce search engine, support multi-field search, not only the text of the full index, but also include filtering function ( 比如价格区间过滤 ), summary function ( 比如结果集中品牌数量汇总 ), basic statistical functions.

    • Indexers and searchers are together, primarily for brevity, without having to start multiple instances.

    • Multiple indexes are supported, and multiple indexes can be multi-indexed if there is a primary key association (speed can only be hehe).

    • For 10 million of documents, the average query time for a single word is less than 10ms.

    • For a 8-core 8G memory machine, the QPS reaches 2000.

    • As little as possible with the machine memory, on the 2G machine can also carry out more than 10 million of the document search.

    • Have strong extensibility, can expand their own policies.

    • Can be distributed cluster deployment, increase the number of searchable documents, improve the system's query throughput.

    • Support Chinese word segmentation, but participle is not our focus.

    • Relevance sequencing is supported, but the relevance sort is not our focus.

    • Important point, because it is a comprehensive implementation of the search engine, as far as possible without open source code, all algorithms and data structures are implemented by themselves, of course, can also be easily open source substitution.

Of course, a search engine involved in the part is too much, the next few parts are not our focus, and will not be implemented in depth

    • No reptile part, search engine crawler is another topic, can also write a very complex system out, so we do not involve the crawler part

    • Does not involve the part of the algorithm, the so-called algorithm part is the sorting algorithm, a variety of correlation calculation, this is also an additional topic, and so on after the end of this series of articles to say the sorting algorithm, currently only the basic relevance of the basic sort according to TF*IDF

    • Does not involve participle part, participle part is also a separate topic, directly realizes a very very very very very ( important thing say three times ) Simple Chinese word breaker (a function), can use on the line.

The Code section is now more than half, but it hasn't been optimized yet, and the last distributed engine hasn't been completed yet. But the core part of the code, that is, the search engine itself, the technical part has been completed, is already on the github hosting, so this series of articles appear not to update the situation is not likely, after all, the code has been basically completed.

OK, let's get started, the entire series will be divided into the following sections to describe

    • A single-machine search engine architecture, including the search engine module composition, the function of each module has their relationship, this part will have an understanding of the search engine as a whole, easy to follow the detailed description of the article, this part may be relatively short, back to the third chapter in detail.

    • The bottom technical part of the search engine, this part of more content, will be divided one by one, including inverted index technology, positive row index technology, Word segmentation algorithm, Mmap technology, these are the necessary underlying technology to form a search engine, will be introduced in this part

    • Step-by-step implementation of a single-machine search engine, according to the module from the bottom of the inverted and positive row index to achieve the top-level engine part of the implementation, this part if it involves the corresponding data structure and algorithm will be written separately, such as the hash table algorithm, B+tree algorithm, bitmap algorithm, Some of the algorithms that I have not implemented in this engine are also spoken, such as jump tables, prefix trees, bron filters, and so on.

    • Distributed section "TODO: Need to wait for my code to finish", including how to distribute, if you synchronize between machines, if the index is fragmented

The code is already open source on git, and I'm going to publish it again this week, and there's really no way to look at a bunch of code.

Well, it is open a head, the article update frequency will be in a week 3 to 5 articles about it, welcome to scan the following public number subscription, the first will be issued here:)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.