If you want to develop a simple python crawler case and run it in a Python3 or above environment, what you need to know to complete a simple python What about reptiles?
Crawler's architecture implementation
crawlers include scheduler, manager, parser, downloader, and output. The scheduler can understand the entry of the primary function as the head of the entire crawler, and the manager implementation includes the ability to judge whether the URL is repeated, and to add the crawled URLs to the list to prevent repeated crawls. Parser is to parse the content of the webpage, parse out the new URL and Web content. The downloader is the URL parsed out by the download parser . The output is the name of the device.
1.1 Scheduler
I understand that like the main function of the portal, you can start the crawler, stop the crawler and monitor the operation of the crawler.
1.2 Manager
the manager mainly URL to manage, including crawled URL and to be crawled. URL , categorize and add two Set , why the use of Set This data structure? It will be introduced later.
1.3 Downloader
The Downloader accepts from URL Manager passed in URL This completes the functionality of the downloader by converting it to a string.
1.4 Parser
functions include parsing valuable data, where you need to understand the basic HTML knowledge to crawl the specified data. The Web page contains many URLs, which are parsed out and then added to the manager for the next loop.
1.5 output Device
slightly
Further updates will be updated to help you learn about Python Web development together.
Python static web crawler related knowledge