[Turn]: talking about the problems in the design of web crawler

Source: Internet
Author: User

Network Spider now has several open source, Larbin,nutch,heritrix all have the user's land, want to do a own reptile to solve many problems, such as scheduling algorithm, update strategy, distributed storage, and so on, let's take a look.

The main thing a reptile has to do is the following
From a web portal, analysis links, one layer of traversal, or crawl from a set of web portals or from a list of RSS feeds; get the source of each page stored in a disk or database; traverse the captured pages for processing, such as extracting the text, removing the weight, etc., and indexing the processed text according to the purpose, Classification, clustering and other operations. Above is the personal understanding Oh, hehe. In these processes, there are approximately the following issues

how to get a Web page source or RSS feed.
If it is a general crawler, is to give a few entry page, and then follow the hyperlink to traverse the algorithm of a page of a page of the crawl, this situation is very few sources of Web pages, you can choose from hao123 and other Web site Daquan website for the entrance began to crawl. If you do vertical search, then manually collect some of the industry's website, form a list, starting from this list to crawl. If you're crawling rss, you'll need to collect RSS feeds, now the Big portal news channel and the mainstream of the blog system have RSS function, you can climb the site first, find the link to the RSS, to get each link content, analysis is the RSS format, if it is to save this link to the RSS source database, This RSS feed is specifically crawled in the future. There is a manual to organize, the general blog RSS is regular, the main domain name with a user name followed by a fixed page of RSS, such as http://www.abc.com/user1/ Rss.xml, so get a user dictionary, stitching RSS address, and then use the program to detect whether there is this page to organize the RSS feed for each site. After sorting out the RSS feed and then manually set the RSS feed weight and Refresh time interval.

If a lot of source page, how to use multithreading to effectively scheduling processing, and not waiting for each other or repeated processing.
If you have 5 million pages to crawl now, be sure to use multithreading or distributed multiple processes to deal with. You can split the page horizontally, each thread to handle a section of the child, so that each thread does not need to sync, each processing their respective on the line. For example, the 500W page to assign a id,2 thread to let the first thread to crawl 1,3,5 Web page, the second thread to crawl 2,4,6 Web page, so that many threads can basically balance, and will not wait for each other, and will not repeat processing, will not pull off the page. Each thread takes out 1w pages at a time and records the highest source page ID number, and then extracts the next 1W page from the database, which is larger than the source page ID number, after processing this batch, until you have crawled through all the pages that this thread will process. 1w This value can be adjusted appropriately according to the memory of the machine. In order to prevent the end of the panic, so to support the continuation of the breakpoint, to the processing of each thread to save the status, each batch of pages to be recorded in the thread of the largest page ID, record to the database, process restart can read this ID, and then grab the back page.

How to use the CPU as much as possible, do not let the thread in the waiting, sleep, blocking and other idle state. Also, try to use fewer threads to reduce context switching.
The reptile has two places to need IO operation, grasps the webpage the time needs to pass through the network card to visit the net, catches the webpage to write the content to the disk or the database. So these two parts are to be operated with asynchronous IO, this can be done without thread blocking. Waiting for Web pages to grab or write disk files, the network card and hard disk support direct memory read, a large number of IO operations in the hardware-driven queues queued, without consuming any CPU.. NET asynchronous operations use a thread pool, Instead of creating and destroying threads frequently, reduced overhead, so the threading model does not need to be considered, the IO model does not take into account,. NET asynchronous IO operations directly using the completion port, very efficient, the memory model does not need to be considered, the entire crawl process threads do not need to access the shared resources, in addition to the database source page, Each of the tubes, and also each thread is segmented processing, can achieve the lock-free programming.

how to not collect duplicate pages.
To go heavy you can use King Director's long filter, each thread uses a BitArray, Inside saves this batch source page to crawl the page the hash value situation, crawls down the source page analysis link, goes to this bitarray to judge before has grabbed this page, does not have the words to grasp down, If you catch it, you don't care. Suppose a source page has 30 links, a batch of 10W source page, 300w link BitArray should also not account for too much memory. So it's no problem to have a five or six-thread processing at the same time.

grab down the page and save it faster. Save to a distributed file system or to a database.
If you save to disk, you can create a folder for each domain name, usually the pages of this site are placed in this folder, as long as the filename is not the same, there will be no conflict. If you save the page to disk, the database has its own set of lock management mechanism, directly with bulk copy put the database on the line. The commonly frequent write disk may cause the CPU to be too high, but frequently writes the database CPU to be good some. and sqlserver2008 supports FileStream types of fields, has good performance when saving large text fields, and can also be accessed using the database's APIs. So I think it would be better to save SQL Server without the efficient and mature Distributed file system as GFS.

How to adjust the collection time interval of the crawler effectively according to the update frequency of the webpage.
As a crawler to understand some HTTP protocol, if you want to catch the Web page support last-modified or ETag head, we can first send a heads request to test whether the page has changed to decide if you want to crawl, but many sites do not support this thing, so let the crawler also very laborious, Let your site also lose more performance. In this way we have to mark each source page update time interval and weight, and then based on these two values to use a certain algorithm to develop the spider's Update strategy.

what to do with the data collected.
Can crawl an industry's website, in the local segmentation and indexing, made vertical search engine. A certain training algorithm can be used to automatically classify the crawled pages and make a news portal. The text similarity algorithm can be used to deal with the text clustering after the death of small popularity.

how to not affect the performance of the other site.
Now a lot of web sites are crawling fear, because some spiders get to live a site can climb, crawling people's website normal users can not access. So a lot of webmaster think a lot of ways to deal with reptiles, so we write a crawler to follow the robot protocol, control the amount of time per site to visit.

Other questions:
Http://notes.zhourenjian.com/whizznotes/xhtml/4308.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.