/* Copyright Notice: Can be reproduced arbitrarily, please be sure to indicate the original source of the article and author information. */
Research on search engine cache strategy
Zhang Junlin
Timestamp:2005 year October
A The conclusion about search engine user query:
(1) The user query has a large proportion of repeatability. There are 30% to 40% user queries that are duplicate queries.
(2) Most duplicate user queries are re-accessed at shorter intervals.
(3) Most users ' queries are short queries and contain about 2-5 words.
(4) Users generally view only the first three pages of the returned results (the first 30 return results). 58% users view only the first page (TOP 10), 15% users view the second page, and no more than 12% users will see the results of the third page after the search.
(5) The extent of user query differences. There is a large query level, and about 63.7% of the user queries in 1 million user queries appear only once. On the other hand, centralized duplicate queries are also very concentrated: 25 high-frequency queries account for about 1.23%-1.5% of total queries.
Two The basic strategy of the cache
(1) LRU: least Recently used policy
Basic assumption: Cache records that have rarely been repeatedly accessed recently will not be accessed in the near future. This is the simplest kind of cache policy. Sort the user queries by the most recent time, and the elimination strategy will eliminate the oldest queries out of the cache.
(2) FBR: Consider not only the time but also the reference counting problem.
FBR the cache into three different parts on the basis of the LRU strategy: New,old,middle
NEW: Stores records that have been recently accessed;
Old: Stores the least recently used batch of records;
Middle: Stores a batch of records between new and old;
The reference count does not take into account the new zone's record, only the record reference count of the old and middle two regions is increased, and the record is replaced with the lowest reference count from the old area when the record is replaced.
(3) LRU/2: For LRU improvements, calculate the second to last access to the total LRU, the old record is eliminated.
(4) Slru:
The cache is divided into two parts: non-protected areas and protected areas. Each region's records are sorted from highest to lowest frequency, with the high end called the MRU and the low end called LRU. If a query is not found in the cache, the query is placed on the MRU side of the unprotected zone, and if a query is hit on the cache, the query record is placed on the MRU side of the protected area, and if the protected area is full, the record is placed from the reserve into the non-protected MRU, This preserves the minimum number of records to be accessed two times. The elimination mechanism is to eliminate the LRU of the non-protected areas.
(5) Landlord strategy
To add a record to the cache, give this record a value (DEADLINE), if you need to retire the record, choose the cache DEADLINE the smallest of the elimination, and all the other records in the cache minus the DEADLINE value of this obsolete record , if a record is hit, the deadline of the record is magnified to a certain value.
(6) Tslru:topic based SLRU: the same as the SLRU policy, but instead of adjusting the replacement policy according to the query, it is adjusted according to the subject that the query belongs to.
(7) Tlru:topic based LRU
The basic strategy is the same as LRU, except that the subject (TOPIC) Information of the query is retained, and for a query, not only is the search result of the subject enter the cache, but also the query of the same subject in the cache and its results are adjusted to the latest entry to the cache. Can be seen as the subject of LRU, while LRU is querying LRU.
(8) PDC (probability driven cache): Establish a probabilistic model for the user's browsing behavior, then adjust the record priority level within the cache, and for a query, increase the level of the number of documents in the cache that the user browses to.
(9) Prefetch strategy
Prefetching is the system that predicts the behavior of the user in a very short period of time, and then stores the data that is involved in the behavior in the cache in advance. There are different prefetching strategies, such as prefetch policies: Because the average user looks at the second page of results after viewing the results of the first page, the second page of the user's query is first pre-retrieved into the cache, which reduces access time.
(10) Level two cache
There is a level two cache, the first is the query result cache, the original query and related files are preserved; the second level cache is the inverted list of documents cache, which is the inverted table information in the index of a word in the query, which primarily reduces disk I/O time. The substitution strategy takes LRU, and the result proves that the method improves performance by 30%.
(11) Level three cache
is a two-level cache improvement strategy, in addition to the two cache in the retention of the two cache, plus a cache, the cache recorded two word query of the inverted document intersection record, so one is to eliminate the disk I/O time, another reduction of the computation intersection operation, Effectively reduces the amount of computation.
Three Performance analysis and comparison of cache method
(1) LRU is suitable for storing relatively small record effect is good.
(2) A medium-sized cache can satisfy a large part of repeated user queries. (approximately 20% of queries can be found in the medium-size cache)
(3) The combination of time factor and hit count is better than the strategy of considering time factor. Experiments show that Fbr/lru2/slur performance is always better than LRU strategy.
(4) For the small cache, the static cache strategy is better than the dynamic cache strategy, the hit rate is higher.
(5) For LRU, the big cache's repeat hit rate is about 30%.
(6) for the big cache, the TLRU is slightly better than LRU, but the difference is not too big. For the small cache, the conclusion is the opposite.
(7) As the cache increases gradually, the hit rate increases gradually, and for SLRU, its performance is independent of the size of two sub-divisions.
(8) The PDC's hit ratio is higher than the LRU deformation algorithm, which has a 53% hit ratio, but the computational complexity is high.