In order to crawl the full details page, you typically start a multi-threaded concurrent crawl from the list page, the number of concurrent threads is affected by the network environment (typically time-outs) and server performance (typically HTTP response 500).
1, the first page as the crawl portal URL, parse out the Details page URL and other paging URL, detail page first crawl, avoid too many cached URLs;
2, see the total number of pages (if there is not a total number of pages in the page, the total number of records/page records calculated how many pages), the crawl process does not resolve the paging URL, add all the paging URL at a time, of course, you can also crawl the first page when you add all the page, each crawl page resolved the details page URL ;
3, some sites in the details page to provide a previous, the next function, can be the first, the last detail page as the entrance, through the previous, the next crawl all, the thread can add several intermediate details page.
Crawl sub-total crawl and incremental crawl, for the large amount of data can not be crawled in a short time, may only crawl part of the day, the next day the IP was blocked, but also use the proxy. There is a time lag between the development phase and the system release, the full amount of data crawled when the program was developed, the system enters the integration test, it may be released after one months; incremental crawls if you increment by day, you also need to consider crawling scenarios with a large time span (one months).
Vertical Crawler Crawl Paging data