Python's scrapy crawler frame installation and simple use

Last Update:2017-07-21 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface: The name of the Python crawler framework has long been heard. In recent days, I have learned the framework of the Scrapy crawler, and I will share with you what I understand. There is an improper expression, hope that the great gods treatise.

First, a glimpse of scrapy

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.

It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler.

This document will give you an idea of how it works by introducing the concepts behind scrapy and determine if scrapy is what you need.

When you are ready to start your project, you can refer to the Getting Started tutorial.

Second, Scrapy installation introduction

Scrapy Framework operating platform and related auxiliary tools

Python 2.7 (Python latest version 3.5, 2.7 version selected here)
Python Package:pip and Setuptools. The PIP now relies on setuptools, and if it is not installed, Setuptools is automatically installed.
lxml. Most Linux distributions bring their own lxml. If missing, see http://lxml.de/installation.html
Openssl. Systems other than Windows (see the Platform Installation Guide) are already available.

You can use Pip to install Scrapy (it is recommended to use PIP to install the Python package).

pip Install scrapy

Installation process under Windows:

1. After installing Python 2.7, you need to modify the PATH environment variables to add Python executable programs and additional scripts to the system path. Add the following path to PATH the:

C:\Python27\; C:\Python27\Scripts\;

In addition, you can use the cmd command to set the path:

C:\python27\python.exe c:\python27\tools\scripts\win_add2path.py

After the installation configuration is complete, you can execute the command Python--version to view the version of Python installed. （）

2. Install Pywin32 from http://sourceforge.net/projects/pywin32/

Please make sure to download the version that matches your system (Win32 or AMD64)

Installing PIP from https://pip.pypa.io/en/latest/installing.html

3. Open the Command Line window and confirm that it pip is installed correctly:

Pip--version

4, so far Python 2.7 and pip has been able to run correctly. Next Install Scrapy:

pip Install scrapy

The Scrapy installation at this point in Windows has ended.

Three, scrapy introductory tutorial

1. Create Scrapy project in CMD.

scrapy startproject Tutorial

H:\python\scrapydemo>scrapy startproject tutorialnew scrapy project ' Tutorial ', using template directory ' f:\\ Python27\\lib\\site-packages\\scrapy\\templates\\project ', created in:    H:\python\scrapyDemo\tutorialYou can Start your first spider with:    CD tutorial    scrapy genspider Example example.com

2, the file directory structure is as follows:.

Parsing the SCRAPY framework structure:

scrapy.cfg: The configuration file for the project.
tutorial/: The Python module for the project. You will then join the code here.
tutorial/items.py: Item file in the project.
tutorial/pipelines.py: The pipelines file in the project.
tutorial/settings.py: The setup file for the project.
tutorial/spiders/: The directory where the spider code is placed.

3, write a simple crawler

1. In item.py, configure the field instances where the pages are to be collected.

1 #-*-Coding:utf-8-*-2  3 # Define Here the models for your scraped items 4 # 5 # See documentation In:6 # http:/ /doc.scrapy.org/en/latest/topics/items.html 7  8 import scrapy 9 from Scrapy.item Import Item, Field10 one class Tutoria LItem (Item):     title = field ()     author = field ()     releasedate = field ()

2. Write the website to be collected in tutorial/spiders/spider.py and collect each field separately.

 1 #-*-coding:utf-8-*-2 import sys 3 from SCRAPY.LINKEXTRACTORS.SGML import sgmllinkextractor 4 from Scrapy.spiders impo RT Crawlspider, Rule 5 from Tutorial.items import Tutorialitem 6 7 reload (SYS) 8 sys.setdefaultencoding ("Utf-8") 9 10 11      Class Listspider (Crawlspider): 12 # crawler name @ name = "Tutorial" 14 # Set Download delay = 116 # Allow domain 17     Allowed_domains = ["news.cnblogs.com"]18 # start URL19 start_urls = ["https://news.cnblogs.com" 21 ]22 # Crawl rule, without callback, means recursively crawling to the class URL (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n /page/\d ',)), Rule (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n/\d+ ')), callback= ' parse_content '),          26) 27 28 # Parse content function def parse_content (self, Response): $ item = Tutorialitem () 31 32 # Current URL33  title = Response.selector.xpath ('//div[@id = "News_title"] ') [0].extract (). Decode (' utf-8 ') item[' title '] = Title35 author= Response.selector.xpath ('//div[@id = "News_info"]/span/a/text () ') [0].extract (). Decode (' Utf-8 ') Notoginseng item[' author '] = Author38 releasedate = Response.selector.xpath ('//div[@id = "News_info"]/span[@class = "Time"]/text () ') [0].extract (). Decode (+ ' utf-8 ') item[' releasedate '] = Releasedate42 yield item

3. Save the data in the tutorial/pipelines.py pipeline.

1 #-*-Coding:utf-8-*-2  3 # Define your item pipelines here 4 # 5 # Don ' t forget to add your pipeline to the Item_  Pipelines Setting 6 # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 Import JSON 8 import codecs 9 10 11 Class Tutorialpipeline (object):     def __init__ (self):         self.file = Codecs.open (' Data.json ', mode= ' WB ', encoding= ' Utf-8 ') #数据存储到data. Json14     def process_item (self, item, spider): Line         = Json.dumps (Dict (item)) + "\ n"         self.file.write (Line.decode ("Unicode_escape"))         return item

4. Configure the execution environment in tutorial/settings.py.

1 #-*-Coding:utf-8-*-2  3 bot_name = ' Tutorial ' 4  5 spider_modules = [' Tutorial.spiders '] 6 newspider_module = ' Tutorial.spiders ' 7  8 # Prohibit COOKIES, prevent being banned 9 cookies_enabled = False10 Cookies_enables = False11 12 # Set pipeline, here to implement the number Write file Item_pipelines = {     tutorial.pipelines.TutorialPipeline ': 30015}16 17 # Set the maximum depth of crawler crawl Depth_limit = 100

5, the new main file execution crawler code.

1 from scrapy import cmdline 2 cmdline.execute ("Scrapy crawl Tutorial". Split ())

Finally, the JSON data for the acquisition result is obtained in the Data.json file after executing main.py.

Original link: http://www.cnblogs.com/liruihua/p/5957393.html

Python's scrapy crawler frame installation and simple use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More