[Careercup] 10.7 simplified search engine simple SEO

Source: Internet
Author: User

10.7 Imagine A Web server for a simplified search engine. This system has a machines to respond to search queries, which if and call out using Processsearch (string query) to an Other cluster of machines to actually get the result. The machine which responds to a given query was chosen at random, so can not guarantee that the same machine would alway s respond to the same request. The method Processsearch is very expensive. Design A caching mechanism for the most recent queries. Be sure to explain how do you would update the cache when data changes.

This problem said fake has a simple search engine Web server, the system has 100 of the machine to respond to the retrieval, you can use Processsearch (string query) to get the results on other machines, each machine response retrieval is random, does not guarantee that each will respond to the same request. The Processsearch method is very expensive and designs a caching mechanism to handle recent searches. As described in the book, let's start by making some assumptions:

1. Instead of calling Processsearch as needed, it is better to set all the retrieval processing to occur on the first called machine.

2. We need to cache the retrieval to be very large.

3. Calls between machines are fast.

4. The result of the search is an ordered list of URLs, each of which consists of a 50-character title and a 200-character summary.

5. The most frequently accessed search will always appear in the cache.

System Requirements:

The main requirements are to implement the following two functions:

1. Efficient find when a keyword is given

2. New data will replace the old data location

We also need to update and clear the cache when the search results change. Because some of the very common diseases of retrieval are permanent in the buffer, we cannot wait for the cache to fail naturally.

Step one: Designing a single system's memory register

We can use a mix of linked lists and hash tables, we create a list, when a node is accessed, automatically move it to the beginning, so that the end of the list is the oldest data. We use a hash table to establish the mapping of the nodes in the search and list, so that we can not only return the cached results efficiently, but also move the nodes to the front of the list, see the code below:

classNode { Public: Node*Pre; Node*Next; Vector<string>results; stringquery; Node (stringQ, vector<string>Res) {Results=Res; Query=Q; }};classCache { Public:    Const Static intMax_size =Ten; Node*head, *tail; Unordered_map<string, node*>m; intSize =0; Cache () {}voidMovetofront (Node *node) {        if(node = = head)return;        Removefromlinkedlist (node); Node->next =Head; if(Head! =nullptr) {Head->pre =node; } head=node; ++size; if(Tail = =nullptr) {Tail=node; }    }    voidMovetofront (stringquery) {        if(m.find (query) = = M.end ())return;    Movetofront (M[query]); }    voidRemovefromlinkedlist (Node *node) {        if(node = = nullptr)return; if(Node->next! = nullptr | | Node->pre! =nullptr) {            --size; } Node*pre = node->Pre; if(Pre! =nullptr) {Pre->next = node->Next; } Node*next = node->Next; if(Next! =nullptr) {Next->pre =Pre; }        if(node = =head) {Head=Next; }        if(node = =tail) {Tail=Pre; } node->next =nullptr; Node->pre =nullptr; } Vector<string> GetResults (stringquery) {        if(m.find (query) = = M.end ())returnvector<string>(); Node*node =M[query];        Movetofront (node); returnNode->results; }    voidInsertresults (stringQuery, vector<string>results) {        if(M.find (query)! =M.end ()) {Node*node =M[query]; Node->results =results;            Movetofront (node); return; } Node*node =NewNode (query, results);        Movetofront (node); M[query]=node; if(Size >max_size) {             for(unordered_map<string, node*>::iterator it = M.begin (); It! = M.end (); ++it) {                if(It->first = = tail->query) m.erase (IT);        } removefromlinkedlist (tail); }    }};

Step two: Extend to multiple machines

For multiple machines, we have a number of options:

Option 1: Each machine has its own cache, the advantage of this method is fast, because there is no inter-machine call, but the disadvantage is not efficient

Option 2: Each machine has a copy of the cache, when the new item is added to the cache, sent to all the machines, designed to let the common search exists on all the devices, the disadvantage is that the buffer space is limited, unable to save large amounts of data

Option 3: Each machine saves a part of the cache, and when machine I needs to get a search result, it needs to find out which one has the result and get the result on that one. But the problem is how the machine I know that the machine has the results, we can use a hash table to establish mapping, hash (query)%N, can quickly know which machine has the desired results.

Step three: Update the results when the content changes

Some of the retrieval is very common, so there will always be in the buffer, we need some mechanism to update the results, when its content changes, the cache of the results page should be transformed accordingly, mainly in the following three cases:

1. The contents of the URL have changed

2. When the page's rankings change, the order of the results changes.

3. A new page for a specific search

For 1 and 2, we set up a separate hash table that tells us which search and which URL is mapped, which can be done separately on different machines, but may require a lot of data. Also, we can periodically update the cache if the data does not need to be refreshed immediately. For 3, we can implement an automatic timeout mechanism, we set a time period, if there is no retrieval in this time period, no matter how often it is retrieved between it, we will clear it, so that all data will be periodically updated.

Step four: To further enhance

An optimization is when a search is particularly frequent, such as a 1% of the proportion of a retrieval, we let the machine I send a request to machine J, rather than on the machine I on the result of the existence of its own cache.

And then there is that we will retrieve different machines based on the hash values, instead of randomly allocating them.

Another optimization is the previously mentioned automatic timeout automatic time out mechanism, is to automatically erase the data after X minutes, but sometimes we want to set different X-values for different data, so that each URL has a time-out value based on how often this page has been updated.

[Careercup] 10.7 simplified search engine simple SEO

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.