10.7 Imagine A Web server for a simplified search engine. This system has a machines to respond to search queries, which if and call out using Processsearch (string query) to an Other cluster of machines to actually get the result. The machine which responds to a given query was chosen at random, so can not guarantee that the same machine would alway s respond to the same request. The method Processsearch is very expensive. Design A caching mechanism for the most recent queries. Be sure to explain how do you would update the cache when data changes.
This problem said fake has a simple search engine Web server, the system has 100 of the machine to respond to the retrieval, you can use Processsearch (string query) to get the results on other machines, each machine response retrieval is random, does not guarantee that each will respond to the same request. The Processsearch method is very expensive and designs a caching mechanism to handle recent searches. As described in the book, let's start by making some assumptions:
1. Instead of calling Processsearch as needed, it is better to set all the retrieval processing to occur on the first called machine.
2. We need to cache the retrieval to be very large.
3. Calls between machines are fast.
4. The result of the search is an ordered list of URLs, each of which consists of a 50-character title and a 200-character summary.
5. The most frequently accessed search will always appear in the cache.
System Requirements:
The main requirements are to implement the following two functions:
1. Efficient find when a keyword is given
2. New data will replace the old data location
We also need to update and clear the cache when the search results change. Because some of the very common diseases of retrieval are permanent in the buffer, we cannot wait for the cache to fail naturally.
Step one: Designing a single system's memory register
We can use a mix of linked lists and hash tables, we create a list, when a node is accessed, automatically move it to the beginning, so that the end of the list is the oldest data. We use a hash table to establish the mapping of the nodes in the search and list, so that we can not only return the cached results efficiently, but also move the nodes to the front of the list, see the code below:
classNode { Public: Node*Pre; Node*Next; Vector<string>results; stringquery; Node (stringQ, vector<string>Res) {Results=Res; Query=Q; }};classCache { Public: Const Static intMax_size =Ten; Node*head, *tail; Unordered_map<string, node*>m; intSize =0; Cache () {}voidMovetofront (Node *node) { if(node = = head)return; Removefromlinkedlist (node); Node->next =Head; if(Head! =nullptr) {Head->pre =node; } head=node; ++size; if(Tail = =nullptr) {Tail=node; } } voidMovetofront (stringquery) { if(m.find (query) = = M.end ())return; Movetofront (M[query]); } voidRemovefromlinkedlist (Node *node) { if(node = = nullptr)return; if(Node->next! = nullptr | | Node->pre! =nullptr) { --size; } Node*pre = node->Pre; if(Pre! =nullptr) {Pre->next = node->Next; } Node*next = node->Next; if(Next! =nullptr) {Next->pre =Pre; } if(node = =head) {Head=Next; } if(node = =tail) {Tail=Pre; } node->next =nullptr; Node->pre =nullptr; } Vector<string> GetResults (stringquery) { if(m.find (query) = = M.end ())returnvector<string>(); Node*node =M[query]; Movetofront (node); returnNode->results; } voidInsertresults (stringQuery, vector<string>results) { if(M.find (query)! =M.end ()) {Node*node =M[query]; Node->results =results; Movetofront (node); return; } Node*node =NewNode (query, results); Movetofront (node); M[query]=node; if(Size >max_size) { for(unordered_map<string, node*>::iterator it = M.begin (); It! = M.end (); ++it) { if(It->first = = tail->query) m.erase (IT); } removefromlinkedlist (tail); } }};
Step two: Extend to multiple machines
For multiple machines, we have a number of options:
Option 1: Each machine has its own cache, the advantage of this method is fast, because there is no inter-machine call, but the disadvantage is not efficient
Option 2: Each machine has a copy of the cache, when the new item is added to the cache, sent to all the machines, designed to let the common search exists on all the devices, the disadvantage is that the buffer space is limited, unable to save large amounts of data
Option 3: Each machine saves a part of the cache, and when machine I needs to get a search result, it needs to find out which one has the result and get the result on that one. But the problem is how the machine I know that the machine has the results, we can use a hash table to establish mapping, hash (query)%N, can quickly know which machine has the desired results.
Step three: Update the results when the content changes
Some of the retrieval is very common, so there will always be in the buffer, we need some mechanism to update the results, when its content changes, the cache of the results page should be transformed accordingly, mainly in the following three cases:
1. The contents of the URL have changed
2. When the page's rankings change, the order of the results changes.
3. A new page for a specific search
For 1 and 2, we set up a separate hash table that tells us which search and which URL is mapped, which can be done separately on different machines, but may require a lot of data. Also, we can periodically update the cache if the data does not need to be refreshed immediately. For 3, we can implement an automatic timeout mechanism, we set a time period, if there is no retrieval in this time period, no matter how often it is retrieved between it, we will clear it, so that all data will be periodically updated.
Step four: To further enhance
An optimization is when a search is particularly frequent, such as a 1% of the proportion of a retrieval, we let the machine I send a request to machine J, rather than on the machine I on the result of the existence of its own cache.
And then there is that we will retrieve different machines based on the hash values, instead of randomly allocating them.
Another optimization is the previously mentioned automatic timeout automatic time out mechanism, is to automatically erase the data after X minutes, but sometimes we want to set different X-values for different data, so that each URL has a time-out value based on how often this page has been updated.
[Careercup] 10.7 simplified search engine simple SEO