Build your own underlying non-Lucene distributed search engine architecture

Source: Internet
Author: User

Build your own underlying distributed search engine architecture (not Lucene)

As you know, the search engine technology is not only similar to Baidu homepage, but also can be derived from data analysis tools, business intelligence tools, and many other selling applications, and even the discovery of social link channels.
Even these non-search engine products are the most important, because you do not need to do anything Baidu does.
Therefore, the search engine technology needs to understand the principles before it can be extended. It is very important to do the search engine without Lucene. With this building block, we can build houses and cars.

The purpose of a search engine is to search for keywords in articles collected by the O (1) Second-kill crawler. The difference between the search engine and the database like is that it is fast, the disadvantage is that the word cannot be searched if no index is created.

This article is intended to: build a bottom-layer module of the search engine server, expand your thinking and analysis knowledge, and lay the foundation for my next step.
However, I want to ensure that this basic idea is reliable and can share the data structure knowledge with you.


To build your own search engine (not Lucene), you need two major knowledge aspects:

1. Simple enough distributed;

2. Basic data structures and algorithms, and know the ing between complexity and data structures.

Focus: 2

Therefore, this article will take the time to explain how to use the set that comes with. Net to do this.

 

First, quickly include the following parts:

1. In the file + database server group, you must first have a large number of articles, articles collected by crawlers, or how you use them. These articles computed by t cannot be stored on a single machine, there is no need to put them together, because the hard disk concurrency performance is poor, you need to have a bunch of machines that store these files and process files, and hand them over to these machines for concurrent processing, instead

One machine is queued for processing. If you use. net for the WCF architecture, you can search for it by yourself.

2. In the Web server group, the database must handle a large number of search requests and the search engine makes sense. Therefore, it is necessary for the Web Server Load balancer and reverse proxy to search for reverse proxy, nginx, this document does not describe squid.

3. The memchached group is a distributed Hash table. A machine cannot store the expected amount of memory and searches for memchached by itself.

4. The Indexing Server group is the background server that regularly or has been busy indexing resumes. There are several independent servers.

In short, you need four groups of servers. The number of servers in each group depends on the amount of money you need to adjust. If you have no money, you can create a machine, for example, experiment.

The following is the key part of this article.
First, let's talk about the principle of data structure: (the basic problem to be solved by data structures and algorithms is search and sorting)

1. Search: Find whether or not there is one thing, the complexity is O (1), the hash table, the search engine uses this feature of the hash table, other O (N) or O (logn) search is not suitable for this.
2. Sorting: if everyone has this information, the frequency of occurrence of the word in the article determines the sorting and sorting of the display. The practice of the profession is that the O (N * logn) time space is OK, if you say you have learned only bubble, take the mouse over your head and read a book to see why bubble O (N ^ 2) is an amateur.
3. The concept of dynamic sorting is that the newly inserted elements can reach their desired location with the complexity of O (logn). Generally, the binary heap is used. net is a sortedlist <t, t> generic container. As for the internal implementation, whatever it is, it will satisfy the complexity we need.

Well, the above three concepts are the starting point for us to solve all the problems. The correspondence between complexity and data structure must form a conditioned reflection of the brain. You do not need to take the data structure test for 85 points, however, this awareness is essential to excellent programmers.

Analyze the search process:
0, high frequency dictionary: we prepare a high frequency dictionary in advance, which is a hash dictionary container dict <wordid (INT), word (string)>; this can be placed in the cluster memcached, you can also store them in segments based on the number of machines you have.

1. input test: the user enters a word. First, the word must be numbered in the dictionary and added without any number. Of course, the newly added word cannot be searched, the index can be searched next time;

2. Hash search: to reach O (1), we must know the articles corresponding to this word. We need such a hash dictionary container dict <wordid (INT), sortedlist <docid (INT), freq (INT)>; this will return a bunch of sorted articles. Let's first assume that the words are sorted by frequency.

3. initialize Word Segmentation: In order to implement sortedlist <t, t>, We initialize word segmentation for the first time, which is to find out the two, three, and four-character long words in the article, sort by the frequency of occurrence. This document should also be indexed. Dict <docid (INT), sortedlist <wordid (INT), freq (INT)>
Bytes.

4. Word splitting job: to ensure that all user input is indexed, a background program continuously maintains the word segmentation index of the search engine, you need to serialize the index to a file in time and save it so that the computation does not need to be performed after the next boot.

5. The newly added article requires three steps. The newly added user vocabulary, such as "awesome", requires four steps.

Other auxiliary steps
1. crawlers collect data, the granularity of articles, and news must be stored in big text. HTML tags should be decomposed for both the text field and plain text in the database, values to be decomposed into user IDs, article IDs, forwarding IDs, Fan IDs, and other values that can be computed.
2. semantic analysis, such as positive or negative comments, requires data sorting and AI.
3. Repeat the article or correlation detection, using the frequency of keyword distribution, cosine algorithm (self-searching) to calculate similarity.
 
 
I wrote it here first, and then I thought about it later. I was interrupted by a series of meetings.

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.