160602, how to quickly achieve high concurrency short search

Source: Internet
Author: User

First, the origin of demand

A business line with a large amount of concurrency and a moderate amount of data needs to implement a "title Search" function:

(1) Large concurrency, 20w per second

(2) Moderate data volume, approximately 200w data

(3) Whether participle is required: yes

(4) Whether the data is updated in real time: no

Ii. Common potential solutions and pros and cons

(1) Database search method

How to: Store header data in a database and use like to retrieve

Advantages: Simple Solution

Disadvantage: Can not achieve word segmentation, concurrency is unable to carry

(2) Database Full-Text search method

How to: Store header data in a database and create full-text indexes to retrieve

Advantages: Simple Solution

Cons: concurrency cannot be carried

(3) using an open source scheme to index external

Concrete method: Build Lucene,solr,es and other open source external indexing scheme

Advantages: Better performance than the above two

Cons: Concurrency can be risky, system is heavier, for a simple business to build a set of such systems cost higher

Three, 58 Dragon elder brother's suggestion

Q 1: Dragon elder brother, 58 the title of the first programming contest in the same city as "yellow anti-word filter", you are the champion, was it realized by dat?

Longge: Yes

VoiceOver: What is dat?

Popularity: DAT is a double array trie abbreviation, is a variant of the Trie tree optimization data structure, which in order to ensure the trie tree retrieval efficiency, can greatly reduce the use of memory, often used to solve problems such as retrieval, information filtering. (Specific Baidu for a bit "DAT")

Q 2: Can the above business scenario be implemented using DAT?

Longge: DAT update data is cumbersome and cannot be incrementally

Q 3: Can I use the trie tree directly?

Longge: Trie tree compared to memory

VoiceOver: What is a trie tree?

Popularity: Trie tree, also known as the word search tree, is a tree-shaped structure, is a hash tree variant. Typical applications are used for statistics, saving a large number of strings (but not limited to strings), so it is often used by search engine systems for text frequency statistics. Its advantages are: the use of the common prefix of the string to reduce query time, to minimize the unnecessary string comparison, query efficiency than Hashi. (Source: Baidu Encyclopedia)


For example: The trie tree above can represent a collection of 5 headings, {and, as, at, CN, com}.

Q 4: If you want to support word breakers, more than one word traversal trie tree, you need to merge, right?

Longge: Yes, each word traversal trie tree, you can get the list of doc_id, multiple words get the list merge, is the final result.

Q 5: Brother Lung, what better, more lightweight solution?

Longge: With Trie tree, the data expands the number of documents * Title length is so much, the longer the title, the more documents, the greater the memory footprint. There is a plan, the amount of memory is very small, and the title length independent, very handsome.

Q 6: Is there a relevant article, recommend an article?

Longge: Probably not online, I simply say, the core idea is "memory hash + ID List"

The index initialization steps are : Word breaker for all headings, and a set of key,doc_id with the hash of the word as value

The steps of the query are: Word segmentation of the query, hash the word, directly query the hash table, get doc_id list, and then multiple words to merge

===== Example =====

For example:

Doc1: I love Beijing

DOC2: I love getting home.

DOC3: Good Home

first, the title of Word segmentation :

Doc1: I love BEIJING-I, Love, Beijing

DOC2: I love Home, Love, home

DOC3: Good home, nice

hash The word, build a hash + ID list:

Hash (Me), {doc1, doc2}

Hash (Love), {doc1, doc2}

Hash (Beijing), {Doc1}

Hash (home), {doc2, doc3}

Hash (beautiful)---{DOC3}

Thus, the initialization of all headings is complete, and you will find that the amount of data is not related to the length of the title.

user input "I love", after the participle into {i, love}, the hash of the various participle memory retrieval

Hash (Me)->{doc1, doc2}

Hash (Love)->{doc1, doc2}

Then merge to get the final search result is doc1+doc2.

===== Example end=====

Q 7: What are the advantages of this method?

Longge: Memory operation, can meet a lot of concurrency, latency is very low, memory is not small, the implementation is very simple and fast

Q 8: What is the problem? What's the difference between a traditional search?

Longge: This is a fast over-the-top scenario, because the index itself is not landed, or you need to store the cured header data in the database, and if you do not make it highly available, the data will be slower to recover. Of course, it is easy to do high availability, build two of the same hash index. In addition, there is no horizontal segmentation, but the amount of data is very very very large, but also to do the horizontal segmentation improvement.

160602, how to quickly achieve high concurrency short search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.