International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Search engine Cache Strategy research.

Last Update:2018-07-26 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

/* Copyright Notice: Can be reproduced arbitrarily, please be sure to indicate the original source of the article and author information. */

Research on search engine cache strategy

Zhang Junlin

Timestamp:2005 year October

A The conclusion about search engine user query:

(1) The user query has a large proportion of repeatability. There are 30% to 40% user queries that are duplicate queries.

(2) Most duplicate user queries are re-accessed at shorter intervals.

(3) Most users ' queries are short queries and contain about 2-5 words.

(4) Users generally view only the first three pages of the returned results (the first 30 return results). 58% users view only the first page (TOP 10), 15% users view the second page, and no more than 12% users will see the results of the third page after the search.

(5) The extent of user query differences. There is a large query level, and about 63.7% of the user queries in 1 million user queries appear only once. On the other hand, centralized duplicate queries are also very concentrated: 25 high-frequency queries account for about 1.23%-1.5% of total queries.

Two The basic strategy of the cache

(1) LRU: least Recently used policy

Basic assumption: Cache records that have rarely been repeatedly accessed recently will not be accessed in the near future. This is the simplest kind of cache policy. Sort the user queries by the most recent time, and the elimination strategy will eliminate the oldest queries out of the cache.

(2) FBR: Consider not only the time but also the reference counting problem.

FBR the cache into three different parts on the basis of the LRU strategy: New,old,middle

NEW: Stores records that have been recently accessed;

Old: Stores the least recently used batch of records;

Middle: Stores a batch of records between new and old;

The reference count does not take into account the new zone's record, only the record reference count of the old and middle two regions is increased, and the record is replaced with the lowest reference count from the old area when the record is replaced.

(3) LRU/2: For LRU improvements, calculate the second to last access to the total LRU, the old record is eliminated.

(4) Slru:

The cache is divided into two parts: non-protected areas and protected areas. Each region's records are sorted from highest to lowest frequency, with the high end called the MRU and the low end called LRU. If a query is not found in the cache, the query is placed on the MRU side of the unprotected zone, and if a query is hit on the cache, the query record is placed on the MRU side of the protected area, and if the protected area is full, the record is placed from the reserve into the non-protected MRU, This preserves the minimum number of records to be accessed two times. The elimination mechanism is to eliminate the LRU of the non-protected areas.

(5) Landlord strategy

To add a record to the cache, give this record a value (DEADLINE), if you need to retire the record, choose the cache DEADLINE the smallest of the elimination, and all the other records in the cache minus the DEADLINE value of this obsolete record , if a record is hit, the deadline of the record is magnified to a certain value.

(6) Tslru:topic based SLRU: the same as the SLRU policy, but instead of adjusting the replacement policy according to the query, it is adjusted according to the subject that the query belongs to.

(7) Tlru:topic based LRU

The basic strategy is the same as LRU, except that the subject (TOPIC) Information of the query is retained, and for a query, not only is the search result of the subject enter the cache, but also the query of the same subject in the cache and its results are adjusted to the latest entry to the cache. Can be seen as the subject of LRU, while LRU is querying LRU.

(8) PDC (probability driven cache): Establish a probabilistic model for the user's browsing behavior, then adjust the record priority level within the cache, and for a query, increase the level of the number of documents in the cache that the user browses to.

(9) Prefetch strategy

Prefetching is the system that predicts the behavior of the user in a very short period of time, and then stores the data that is involved in the behavior in the cache in advance. There are different prefetching strategies, such as prefetch policies: Because the average user looks at the second page of results after viewing the results of the first page, the second page of the user's query is first pre-retrieved into the cache, which reduces access time.

(10) Level two cache

There is a level two cache, the first is the query result cache, the original query and related files are preserved; the second level cache is the inverted list of documents cache, which is the inverted table information in the index of a word in the query, which primarily reduces disk I/O time. The substitution strategy takes LRU, and the result proves that the method improves performance by 30%.

(11) Level three cache

is a two-level cache improvement strategy, in addition to the two cache in the retention of the two cache, plus a cache, the cache recorded two word query of the inverted document intersection record, so one is to eliminate the disk I/O time, another reduction of the computation intersection operation, Effectively reduces the amount of computation.

Three Performance analysis and comparison of cache method

(1) LRU is suitable for storing relatively small record effect is good.

(2) A medium-sized cache can satisfy a large part of repeated user queries. (approximately 20% of queries can be found in the medium-size cache)

(3) The combination of time factor and hit count is better than the strategy of considering time factor. Experiments show that Fbr/lru2/slur performance is always better than LRU strategy.

(4) For the small cache, the static cache strategy is better than the dynamic cache strategy, the hit rate is higher.

(5) For LRU, the big cache's repeat hit rate is about 30%.

(6) for the big cache, the TLRU is slightly better than LRU, but the difference is not too big. For the small cache, the conclusion is the opposite.

(7) As the cache increases gradually, the hit rate increases gradually, and for SLRU, its performance is independent of the size of two sub-divisions.

(8) The PDC's hit ratio is higher than the LRU deformation algorithm, which has a 53% hit ratio, but the computational complexity is high.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

search encrypt private search engine offline search engine download lucene search engine php php search engine php search engine tutorial php site search engine sphinx search engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Search engine Cache Strategy research.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support