Python crawler----(2. Scrapy frame)

Last Update:2014-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy Framework, Python developed a fast, high-level screen capture and web crawling framework for crawling web sites and extracting structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications.

Just started learning this framework. Not very good comment. Just feel that this framework has some Java feel and requires too much support from other modules.

(i) Creation of Scrapy project

# Use Scrapy startproject scrapy_test├──scrapy_test│├──scrapy.cfg│└──scrapy_test│├──__init__.py│├── items.py│├──pipelines.py│├──settings.py│└──spiders│├──__init__.py# Creating a Scrapy Project

(ii) Description

SCRAPY.CFG: Project Profile items.py: Data structure definition file that needs to be extracted pipelines.py: pipeline definition, used to further process the data extracted in items, such as Save settings.py: Crawler configuration file Spiders: directory where spiders are placed

(iii) Dependency packages

Dependency packages are more troublesome.

# Python-dev Package Installation apt-get install python-dev# twisted, w3lib, six, Queuelib, Cssselect, libxsltpip install W3libpip Install Twistedpip Install lxmlapt-get install libxml2-dev libxslt-dev apt-get install python-lxmlpip install cssselect PIP Inst All Pyopenssl sudo pip install service_identity# after installation, you can create a project using Scrapy startproject test

(iv) Crawl instances. (Original address: http://blog.csdn.net/HanTangSongMing/article/details/24454453)

Git:https://github.com/maxliaops/scrapy-itzhaopin

(1) Create a scrapy project

[Email protected]:~/python/spit$ scrapy startproject itzhaopinnew scrapy project   ' Itzhaopin '  created in:    /home/dizzy/python/spit/itzhaopinyou can  start your first spider with:    cd itzhaopin     scrapy genspider example example.com[email protected]:~/python/spit$ [ email protected]:~/python/spit$ cd itzhaopin[email protected]:~/python/spit/itzhaopin$  tree.├── itzhaopin│   ├── __init__.py│   ├── items.py│    ├── pipelines.py│   ├── settings.py│   └──  Spiders│       └── __init__.py└── scrapy.cfg# scrapy.cfg:   Http://my.oschina.net/lpe234/admin/new-blog configuration file # items.py:  need to extract the data structure definition file # pipelines.py: Pipeline definition for further processing of data extracted from items, such as preservation, etc. # settings.py:  crawler configuration file # spiders:  place Spider's Directory

(2) Define the data structure to be crawled items.py

From Scrapy.item Import Item, field# defines the data we want to crawl class Tencentitem (item): Name = Field () # Job Name Catalog = field () #  Job Category worklocation = field () # duty station Recruitnumber = field () # number of recruits Detaillink = field () # Job Details link Publishtime = Field () # Publish Time

(3) Realization of Spider class

The spider is a Python class that inherits from Scarpy.contrib.spiders.CrawlSpider and has 3 members that must be defined.

Name: Identification of the spider.

Start_urls: A list of URLs from which spiders start crawling

Parse (): a method. When the page inside the Start_urls is fetched, it needs to call this method to parse the page content, returning the next page to crawl, or returning to the items list.

Create a new spider,tencent_spider.py below the Spiders directory:

#coding =utf-8from scrapy.spider Import basespiderclass dmozspider (basespider): name = ' DMOZ ' allowed_domains = [' DMO Z.org '] Start_urls = [' http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', ' HTTP://WWW.D Moz.org/computers/programming/languages/python/resources/'] def parse (self, response): filename = response.u Rl.split ('/') [-2] open (filename, ' WB '). Write (Response.info)

This is a little simpler. Use

Scrapy Crawl Dmoz # can run spider

Free to look at the Q space, and inadvertently see a college classmate. Alas, the feeling of admiration.

Before I have been arguing to go to the Huangshan Mountain to Tibet, the result has not gone.

The first few days to see the student's Q space state, about a place in Jiangnan. Did not care too much, then saw the trail and southward offset, just saw to Yunnan side. Look at the commentary said, really ready to go to Tibet. Can't help feeling ashamed. Then do this! To the Ming my heart! It is a shame to pay tribute to the apostles.

--August 20, 2014 01:58:27

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler----(2. Scrapy frame)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler----(2. Scrapy frame)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support