1, static and dynamic models
Using the cache in search engines is a great help in reducing query response time and improving system throughput. The cache model of search engine can be divided into two kinds: static and dynamic. The static model uses historical data stored in the query log to add the most frequently accessed items to the cache. This is usually used in cache prefetching. The literature shows that the static strategy is more helpful for the small-volume cache to improve the hit rate, and the dynamic strategy is better in the large-capacity cache. The dynamic model is used to add the most frequently accessed items to memory and remove the cache from items that are not frequently accessed for the cache with limited capacity.
2. Cache storage Mode
There are two ways to store the cache, one is to save the query result file directly, and the other is to store the file ID to be queried. The cache of the directly stored query results file is called result cache, which stores the file summary and other data for the query results that need to be returned, including HTML pages, headings, URLs, and so on, which can be quickly returned when requested by the user. The cache that stores the query file ID stores only the address of the file, which saves a lot of cache space for more items and increases the hit rate of the CAHCE, but each time the cache hit is queried, the search file is recalculated and the response time is increased. The pros and cons of both are weighed, and the usage of two kinds of storage methods is proposed.
A five-level static cache structure (a five-level static cache architecture) is presented, which divides the cache into HTML generation page, file ID generation, inverted record table, cross-inverted record table and file five kinds. (Original: Separate caches for HTML result pages, DocID results, posting lists, intersections of posting lists, and documents)
The article [2] proposes a hybrid strategy that uses the file ID cache to give the second "survival" opportunity to the project that is about to be eliminated from Cahce. This paper proposes that the cache is divided into two parts: HTML page and file ID, each of which contains the HTML and file ID of each item. When a query q is received, if a part of the cache is missing a record of Q, the item about Q, either HTML or the file ID, is filled into the cache. When a query item is about to be retired, the record cached in the HTML cache is eliminated, and the record in the file ID cache is retained, which is equivalent to giving this query item a second chance. If the query arrives again, it will be miss in the HTML cache, but will hit in the file ID cache, which increases the shot rate and ensures responsiveness. When hit in the file ID cache, the record for this query in the HTML cache is complete.
The article further mentions, in the query log, nearly half of the query is isolated, that is, there are a large number of queries will not be repeated in the near future, so if this part of the query added to the two-part cache is a waste, taking up the cache space. In order to reduce this problem, the paper proposes to use feature extraction method to predict the query words, if the prediction of the query word is "orphaned" appears, then only the record is added to the file ID. Although predictions cannot be accurate, they can improve the efficiency of some queries. 3. Cache Policy
To increase the cache hit rate, the dynamic cache policy is usually composed of the license policy (admission policies) and the eviction strategy (eviction policy). The licensing policy moves the most frequently accessed items into the cache, or moves the items that are most likely to be re-accessed in the near future into memory. The eviction strategy identifies the item that is most unlikely to be re-accessed and moves it out of the cache to maintain the cache and increase the hit rate for the project that will be most likely to repeat the visit in the future. Here's a look at what algorithms are available. [1] Search engine cache strategy Research
3.1. Common algorithms
(1) LRU (Least recently used): least Recently used policy
Basic assumption: Cache records that have rarely been repeatedly accessed recently will not be accessed in the near future. This is the simplest kind of cache policy. Sort the user queries by the most recent time, and the elimination strategy will eliminate the oldest queries out of the cache.
(2) FBR: Consider not only the time but also the reference counting problem.
FBR the cache into three different parts on the basis of the LRU strategy: New,old,middle. NEW: Stores records that have been recently accessed. Old: Stores the least recently used batch of records. Middle: Stores a batch of records between new and old. The reference count does not take into account the new zone's record, only the record reference count of the old and middle two regions is increased, and the record is replaced with the lowest reference count from the old area when the record is replaced.
(3) LRU/2: For LRU improvements, calculate the second to last access to the total LRU, the old record is eliminated.
(4) Slru:cache is divided into two parts: non-protected areas and protected areas. Each region's records are sorted from highest to lowest frequency, with the high end called the MRU and the low end called LRU. If a query is not found in the cache, the query is placed on the MRU side of the unprotected zone, and if a query is hit on the cache, the query record is placed on the MRU side of the protected area, and if the protected area is full, the record is placed from the reserve into the non-protected MRU, This preserves the minimum number of records to be accessed two times. The elimination mechanism is to eliminate the LRU of the non-protected areas.
(5) Landlord strategy: When a record is added to the cache, give this record a value (DEADLINE), if you need to eliminate records, choose the cache DEADLINE the smallest of the elimination, At the same time, all other records in the cache are subtracted from the deadline value of this obsolete record, and if a record is hit, the deadline of the record is magnified to a certain value.
(6) Tslru:topic based SLRU: the same as the SLRU policy, but instead of adjusting the replacement policy according to the query, it is adjusted according to the subject that the query belongs to.
(7) Tlru:topic based LRU
The basic strategy is the same as LRU, except that the subject (TOPIC) Information of the query is retained, and for a query, not only is the search result of the subject enter the cache, but also the query of the same subject in the cache and its results are adjusted to the latest entry to the cache.
(8) PDC (probability driven cache): Establish a probabilistic model for the user's browsing behavior, then adjust the record priority level within the cache, and for a query, increase the level of the number of documents in the cache that the user browses to.
3.2. Prefetch strategy
The so-called prefetch, is the system to predict the user in a short period of time behavior, and then the behavior involved in the data pre-stored in the cache, before the user really query, the user will be queried to the project into the cache, when the user retrieves, it can be relatively fast to get the results retrieved. There are different prefetching strategies, for example, because the average user will look at the second page results after viewing the results of the first page, so the second page results of the user query are first prefetch into the cache, which can reduce the access time.
For example, when a search engine receives a request such as <Q,i>, it means to return to the first page of the user query word Q. This is the search engine first will and query word Q related F interface <Q,i+1>,<Q,i+2>,......,<Q,i+F> actively moved into the cache. It is documented that this work will increase the cache hit rate. But this also increases the burden on the backend of the server. In order to make room for prefetch projects, it is also possible to remove items that were originally in the cache from the cache.
Some interesting conclusions have been put forward in order to improve the prefetching strategy. For example, if one finds that the cache is miss when a query condition requests the results of the first page, the probability of a user needing a second page query result is about 10%. If the results of the first page of the query are hit in the cache, the query results for the second page will be raised to 50%. Therefore, the number of prefetch items F is determined based on the query criteria rather than fixed.
Feature-based cache strategy: by extracting the characteristics of several query words (such as the length and frequency characteristics of queries), determine which query words can enter the cache and which will be eliminated. 4. Power Law distribution (Power laws distribution)
On the Internet, 80% of the links point to 15% of the pages, and few nodes occupy a large number of connections, as if 20% of the people have 80% of the wealth. Almost all networks studied by scientists are power-law distributions. The rich will get richer, and the more connections they have, the more connections they will receive. This is an important observation in the search engine to calculate the page rank and improve the retrieval efficiency. About 63.7% of the user queries in 1 million user queries appear only once. On the other hand, 25 high-frequency queries account for about 1.23%-1.5% of total queries. 。 The image of the power law distribution is as follows, the higher the query term ranking, the greater the number of users to retrieve. The cache strategy is designed based on power distribution, which can improve the cache hit rate effectively.
5. Search engine System Structure
[3] To support high query throughput, the commercial search engine uses a parallel architecture. The query message that the user requests is accepted by the agent, and the query task is assigned by the agent to the server for processing. In addition, these parallel servers also employ a multi-replica architecture, where both the file collection and the inverted index are replicated to the cache of multiple servers. Like Google and Baidu, the full inverted index is copied to each server. In this way, each server can process tasks locally without the cost of communication, increasing the throughput rate of the system queries. Experiments have shown that this parallel, multi-copy architecture is better than distributed architecture, in contrast to a distributed architecture that does not use multiple replicas. The following diagram is a parallel, multi-copy search engine structure diagram.
Reference: [1] search engine cache strategy Research
[2] Second chance:a Hybrid approach for Dynamic Result Caching and prefetching in Search Engines
[3] diversified Caching for replicated Web Search Engines