Topic Center

Contact Sales

Home > Developer > Python

Python Development Lightweight Crawler 03

Last Update:2016-09-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python Development Lightweight Crawler (IMOOC summary 03--simple crawler architecture)

现在来看一下一个简单的爬虫架构。要实现一个简单的爬虫，有哪些方面需要考虑呢？    首先需要一个爬虫调度端，来启动爬虫、停止爬虫、监视爬虫的运行情况。    在爬虫程序中有三个模块。首先url管理器来对将要爬取的url和已经爬取过的url这两个数据的进行管理。从url管理器中取出一个待爬取的url将其传送给网页下载器，下载器将指定的网页下载下来存储成一个字符串，这个字符串会传送给网页解析器进行解析，一方面会解析出有价值的数据，另一方面每一个网页都有很多指向其它网页的url，这些url被解析出来之后，可以补充进url管理器。    url管理器、网页下载器、网页解析器这三个模块就形成了循环。只要有相关的url就会一直运行下去，以上这些模块就组成了这个简单的爬虫架构，这个架构的运行就可以将互联网上相关联的所有网页都爬取下来。

简单爬虫架构的动态运行流程

是怎样运行起来的，我们来看一下动态运行流程，以时序图的方式进行讲解。    该时序图中包含以下角色，调度器、url管理器、下载器、解析器、应用。首先调度器询问url管理器有没有待爬取的url, url管理器返回是和否？如果是的话，调度器会从url管理器中取出一个待爬取的url，url管理器将这个url返回给调度器，得到这个url。调度器将这个url传送给下载器，对网页的内容进行下载，下载好后，下载器将url内容返回给调度器。调度器会将url的内容传送给解析器，进行网页的解析，解析之后，会返回价值数据和新的url列表。一方面调度器会将价值数据传送给应用，进行数据的收集，另一方面，会将新的url列表补充到url管理器，这个时候，url管理器还有新的待爬取的url，那么这个循环会一直进行下去，直到爬取完所有的没有爬取过的url，最后调度器会调用应用的方法进行输出，将价值数据输出我们需要的格式。以上就是简单爬虫架构整体动态运行流程。

Python Development Lightweight Crawler 03

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Development Lightweight Crawler 03

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support