160602, how to quickly achieve high concurrency short search

Last Update:2016-06-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the origin of demand

A business line with a large amount of concurrency and a moderate amount of data needs to implement a "title Search" function:

(1) Large concurrency, 20w per second

(2) Moderate data volume, approximately 200w data

(3) Whether participle is required: yes

(4) Whether the data is updated in real time: no

Ii. Common potential solutions and pros and cons

(1) Database search method

How to: Store header data in a database and use like to retrieve

Advantages: Simple Solution

Disadvantage: Can not achieve word segmentation, concurrency is unable to carry

(2) Database Full-Text search method

How to: Store header data in a database and create full-text indexes to retrieve

Advantages: Simple Solution

Cons: concurrency cannot be carried

(3) using an open source scheme to index external

Concrete method: Build Lucene,solr,es and other open source external indexing scheme

Advantages: Better performance than the above two

Cons: Concurrency can be risky, system is heavier, for a simple business to build a set of such systems cost higher

Three, 58 Dragon elder brother's suggestion

Q 1: Dragon elder brother, 58 the title of the first programming contest in the same city as "yellow anti-word filter", you are the champion, was it realized by dat?

Longge: Yes

VoiceOver: What is dat?

Popularity: DAT is a double array trie abbreviation, is a variant of the Trie tree optimization data structure, which in order to ensure the trie tree retrieval efficiency, can greatly reduce the use of memory, often used to solve problems such as retrieval, information filtering. (Specific Baidu for a bit "DAT")

Q 2: Can the above business scenario be implemented using DAT?

Longge: DAT update data is cumbersome and cannot be incrementally

Q 3: Can I use the trie tree directly?

Longge: Trie tree compared to memory

VoiceOver: What is a trie tree?

Popularity: Trie tree, also known as the word search tree, is a tree-shaped structure, is a hash tree variant. Typical applications are used for statistics, saving a large number of strings (but not limited to strings), so it is often used by search engine systems for text frequency statistics. Its advantages are: the use of the common prefix of the string to reduce query time, to minimize the unnecessary string comparison, query efficiency than Hashi. (Source: Baidu Encyclopedia)

For example: The trie tree above can represent a collection of 5 headings, {and, as, at, CN, com}.

Q 4: If you want to support word breakers, more than one word traversal trie tree, you need to merge, right?

Longge: Yes, each word traversal trie tree, you can get the list of doc_id, multiple words get the list merge, is the final result.

Q 5: Brother Lung, what better, more lightweight solution?

Longge: With Trie tree, the data expands the number of documents * Title length is so much, the longer the title, the more documents, the greater the memory footprint. There is a plan, the amount of memory is very small, and the title length independent, very handsome.

Q 6: Is there a relevant article, recommend an article?

Longge: Probably not online, I simply say, the core idea is "memory hash + ID List"

The index initialization steps are : Word breaker for all headings, and a set of key,doc_id with the hash of the word as value

The steps of the query are: Word segmentation of the query, hash the word, directly query the hash table, get doc_id list, and then multiple words to merge

===== Example =====

For example:

Doc1: I love Beijing

DOC2: I love getting home.

DOC3: Good Home

first, the title of Word segmentation :

Doc1: I love BEIJING-I, Love, Beijing

DOC2: I love Home, Love, home

DOC3: Good home, nice

hash The word, build a hash + ID list:

Hash (Me), {doc1, doc2}

Hash (Love), {doc1, doc2}

Hash (Beijing), {Doc1}

Hash (home), {doc2, doc3}

Hash (beautiful)---{DOC3}

Thus, the initialization of all headings is complete, and you will find that the amount of data is not related to the length of the title.

user input "I love", after the participle into {i, love}, the hash of the various participle memory retrieval

Hash (Me)->{doc1, doc2}

Hash (Love)->{doc1, doc2}

Then merge to get the final search result is doc1+doc2.

===== Example end=====

Q 7: What are the advantages of this method?

Longge: Memory operation, can meet a lot of concurrency, latency is very low, memory is not small, the implementation is very simple and fast

Q 8: What is the problem? What's the difference between a traditional search?

Longge: This is a fast over-the-top scenario, because the index itself is not landed, or you need to store the cured header data in the database, and if you do not make it highly available, the data will be slower to recover. Of course, it is easy to do high availability, build two of the same hash index. In addition, there is no horizontal segmentation, but the amount of data is very very very large, but also to do the horizontal segmentation improvement.

160602, how to quickly achieve high concurrency short search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

160602, how to quickly achieve high concurrency short search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

160602, how to quickly achieve high concurrency short search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support