Python static web crawler related knowledge

Source: Internet
Author: User

If you want to develop a simple python crawler case and run it in a Python3 or above environment, what you need to know to complete a simple python What about reptiles?

Crawler's architecture implementation

crawlers include scheduler, manager, parser, downloader, and output. The scheduler can understand the entry of the primary function as the head of the entire crawler, and the manager implementation includes the ability to judge whether the URL is repeated, and to add the crawled URLs to the list to prevent repeated crawls. Parser is to parse the content of the webpage, parse out the new URL and Web content. The downloader is the URL parsed out by the download parser . The output is the name of the device.

1.1 Scheduler

I understand that like the main function of the portal, you can start the crawler, stop the crawler and monitor the operation of the crawler.

1.2 Manager

the manager mainly URL to manage, including crawled URL and to be crawled. URL , categorize and add two Set , why the use of Set This data structure? It will be introduced later.

1.3 Downloader

The Downloader accepts from URL Manager passed in URL This completes the functionality of the downloader by converting it to a string.

1.4 Parser

functions include parsing valuable data, where you need to understand the basic HTML knowledge to crawl the specified data. The Web page contains many URLs, which are parsed out and then added to the manager for the next loop.

1.5 output Device

slightly

Further updates will be updated to help you learn about Python Web development together.

Python static web crawler related knowledge

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.