Iv. Update Strategy The Internet is a real-time change with a strong dynamic nature. The Web page update policy is primarily about deciding when to update pages that have been downloaded before. The following three kinds of common update strategies are: 1. Historical reference policies as the name implies, update the data based on past history of the page to predict when the page will change in the future. In general, it is modeled through the Poisson process to predict. 2. User Experience Strategy
Although search engines can return huge amounts of results for a query condition, users tend to focus only on the results of the previous pages. Therefore, the crawl system can first update those pages that are real in the first few pages of the query results, and then update those later pages. This update strategy also requires the use of historical information. The user experience policy retains multiple historical versions of the Web page and, based on the impact of the previous content changes on the search quality, gives an average value that is used as the basis for determining when to re-crawl.
3. The two update strategies mentioned earlier in the cluster sampling strategy have a prerequisite: the historical information of the Web page is required. There are two problems: first, if the system saves multiple versions of the historical information for each system, it will undoubtedly add a lot of system burden; second, if the new Web page has no historical information at all, the update strategy cannot be determined. This strategy believes that Web pages with many properties, similar to the properties of the Web page, can be considered to update the frequency is similar. To calculate the frequency of updates for a particular category of pages, you only need to sample this type of Web page, with their update cycle as the entire category update cycle. Basic ideas
Five, the structure of distributed grasping system
In general, crawling systems need to face hundreds of millions of pages across the Internet. It is not possible for a single crawler to complete such a task. Often requires multiple crawlers to be processed together. Generally, the crawl system is often a distributed three-layer structure. :
The bottom layer is a geographically distributed data center with several crawl servers in each data center, and several sets of crawlers may be deployed on each crawl server. This constitutes a basic distributed crawl system. There are several ways to work together for different servers in a data center: 1. Master-Slave (Master-slave) master-Slave basic structure:
For master-Slave, there is a dedicated master server to maintain the queue of URLs to be crawled, which is responsible for distributing URLs to different slave servers each time, while the slave server is responsible for actual Web page downloads. The master server is responsible for mediating the load on each slave server in addition to maintaining the URL queue to crawl and the distribution URL. Lest some slave server be too idle or overworked. In this mode, Master tends to become a system bottleneck. 2. Peer-to-peer basic structure:
In this mode, all the crawl servers are not different in the division of labor. Each crawl server can get the URL from the URL queue to be crawled, and then the hash value h for the URL's primary domain, and then the H mod m (where M is the number of servers, for example, M is 3), the number calculated is the host number that handles the URL. For example: Assuming that for URL www.baidu.com, the calculator hash value h=8,m=3, the H mod m=2, so the link is fetched by the server numbered 2. Assuming that this is the No. 0 server to get this URL, then it will be transferred to server 2, the server 2 crawl. There is a problem with this mode, when a server freezes or adds a new server, the hash of all URLs is the result of a change. In other words, this approach is poorly scaled. In response to this situation, there is also a proposed improvement scheme. This improved scenario is consistent hashing to determine the server division of labor. Its basic structure:
A consistent hash hashes the primary domain name of the URL and maps it to a number that ranges between 0-232. The average allocation of this range to the M server, according to the value of the hash of the URL primary domain name to determine which server to crawl. If there is a problem with a server, the Web page that is supposed to be owned by that server is deferred clockwise and crawled by the next server. In this case, there is a problem with the server in time, and it will not affect other work.
Bibliography:
1. "This is the search engine-core technology detailed" Zhang Junlin electronic Industry Press
2. "Search Engine Technology Basics" Liu Yiqun, Tsinghua University Press
Basic principles of Web crawler (II.)