Chapter 2 Crawler

Source: Internet
Author: User
Chapter 2 crawler Spider: crawlers crawl the content of a webpage based on the URL list. In practice, crawlers need to be designed based on different protocols. ProgramAnd optimized. Writing a good crawler is not easy. Just list some of the issues that must be taken into account in the design.
1.strictly follow robots.txt to crawl the content, and capture the content based on sitemap.
2. Control the capture depth and do what you can. This is the same as how many people eat at a meal.
3. the number of dynamic web pages on the network is huge, while crawlers are generally multithreading. If crawlers crawl a domain too frequently, the server may fail to respond accordingly. How to overcome this problem, how can crawlers intelligently balance network load, computing load, and storage load?
4. Do not try to parse scripts on the page CodeIt may be a malicious program or code with bugs. Running locally will increase the computing consumption to an unexpected level.
5. To save bandwidth, You Need To detect the update of the webpage and request the head information to obtain the update status.
6. How can we combine the calculation of Web Page weights into crawler programs? Because a considerable number of CPU resources are not occupied during the crawling process, Io and network are the biggest bottlenecks in crawler efficiency. When the connection structure is still stored in the memory, the calculation of some weights can improve the running efficiency of crawlers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.