Vertical Crawler Crawl Paging data

Source: Internet
Author: User

In order to crawl the full details page, you typically start a multi-threaded concurrent crawl from the list page, the number of concurrent threads is affected by the network environment (typically time-outs) and server performance (typically HTTP response 500).

1, the first page as the crawl portal URL, parse out the Details page URL and other paging URL, detail page first crawl, avoid too many cached URLs;

2, see the total number of pages (if there is not a total number of pages in the page, the total number of records/page records calculated how many pages), the crawl process does not resolve the paging URL, add all the paging URL at a time, of course, you can also crawl the first page when you add all the page, each crawl page resolved the details page URL ;

3, some sites in the details page to provide a previous, the next function, can be the first, the last detail page as the entrance, through the previous, the next crawl all, the thread can add several intermediate details page.

Crawl sub-total crawl and incremental crawl, for the large amount of data can not be crawled in a short time, may only crawl part of the day, the next day the IP was blocked, but also use the proxy. There is a time lag between the development phase and the system release, the full amount of data crawled when the program was developed, the system enters the integration test, it may be released after one months; incremental crawls if you increment by day, you also need to consider crawling scenarios with a large time span (one months).


Vertical Crawler Crawl Paging data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.