Python crawler----(2. Scrapy frame)

Source: Internet
Author: User

Scrapy Framework, Python developed a fast, high-level screen capture and web crawling framework for crawling web sites and extracting structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications.

Just started learning this framework. Not very good comment. Just feel that this framework has some Java feel and requires too much support from other modules.

(i) Creation of Scrapy project

# Use Scrapy startproject scrapy_test├──scrapy_test│├──scrapy.cfg│└──scrapy_test│├──__init__.py│├── items.py│├──pipelines.py│├──settings.py│└──spiders│├──__init__.py# Creating a Scrapy Project

(ii) Description

SCRAPY.CFG: Project Profile items.py: Data structure definition file that needs to be extracted pipelines.py: pipeline definition, used to further process the data extracted in items, such as Save settings.py: Crawler configuration file Spiders: directory where spiders are placed

(iii) Dependency packages

Dependency packages are more troublesome.

# Python-dev Package Installation apt-get install python-dev# twisted, w3lib, six, Queuelib, Cssselect, libxsltpip install W3libpip Install Twistedpip Install lxmlapt-get install libxml2-dev libxslt-dev apt-get install python-lxmlpip install cssselect PIP Inst All Pyopenssl sudo pip install service_identity# after installation, you can create a project using Scrapy startproject test

(iv) Crawl instances. (Original address: http://blog.csdn.net/HanTangSongMing/article/details/24454453)

Git:https://github.com/maxliaops/scrapy-itzhaopin

         (1) Create a scrapy project

[Email protected]:~/python/spit$ scrapy startproject itzhaopinnew scrapy project   ' Itzhaopin '  created in:    /home/dizzy/python/spit/itzhaopinyou can  start your first spider with:    cd itzhaopin     scrapy genspider example example.com[email protected]:~/python/spit$ [ email protected]:~/python/spit$ cd itzhaopin[email protected]:~/python/spit/itzhaopin$  tree.├── itzhaopin│   ├── __init__.py│   ├── items.py│    ├── pipelines.py│   ├── settings.py│   └──  Spiders│       └── __init__.py└── scrapy.cfg# scrapy.cfg:   Http://my.oschina.net/lpe234/admin/new-blog configuration file # items.py:  need to extract the data structure definition file # pipelines.py: Pipeline definition for further processing of data extracted from items, such as preservation, etc. # settings.py:  crawler configuration file # spiders:  place Spider's Directory 

(2) Define the data structure to be crawled items.py

From Scrapy.item Import Item, field# defines the data we want to crawl class Tencentitem (item): Name = Field () # Job Name Catalog = field () #  Job Category worklocation = field () # duty station Recruitnumber = field () # number of recruits Detaillink = field () # Job Details link Publishtime = Field () # Publish Time

(3) Realization of Spider class

The spider is a Python class that inherits from Scarpy.contrib.spiders.CrawlSpider and has 3 members that must be defined.

Name: Identification of the spider.

Start_urls: A list of URLs from which spiders start crawling

Parse (): a method. When the page inside the Start_urls is fetched, it needs to call this method to parse the page content, returning the next page to crawl, or returning to the items list.

Create a new spider,tencent_spider.py below the Spiders directory:

#coding =utf-8from scrapy.spider Import basespiderclass dmozspider (basespider): name = ' DMOZ ' allowed_domains = [' DMO Z.org '] Start_urls = [' http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', ' HTTP://WWW.D Moz.org/computers/programming/languages/python/resources/'] def parse (self, response): filename = response.u Rl.split ('/') [-2] open (filename, ' WB '). Write (Response.info)

This is a little simpler. Use

Scrapy Crawl Dmoz # can run spider


Free to look at the Q space, and inadvertently see a college classmate. Alas, the feeling of admiration.

Before I have been arguing to go to the Huangshan Mountain to Tibet, the result has not gone.

The first few days to see the student's Q space state, about a place in Jiangnan. Did not care too much, then saw the trail and southward offset, just saw to Yunnan side. Look at the commentary said, really ready to go to Tibet. Can't help feeling ashamed. Then do this! To the Ming my heart! It is a shame to pay tribute to the apostles.

--August 20, 2014 01:58:27










Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.