Python's scrapy crawler frame installation and simple use

Source: Internet
Author: User

Preface: The name of the Python crawler framework has long been heard. In recent days, I have learned the framework of the Scrapy crawler, and I will share with you what I understand. There is an improper expression, hope that the great gods treatise.

First, a glimpse of scrapy

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.

It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler.

This document will give you an idea of how it works by introducing the concepts behind scrapy and determine if scrapy is what you need.

When you are ready to start your project, you can refer to the Getting Started tutorial.

Second, Scrapy installation introduction

Scrapy Framework operating platform and related auxiliary tools

    • Python 2.7 (Python latest version 3.5, 2.7 version selected here)
    • Python Package:pip and Setuptools. The PIP now relies on setuptools, and if it is not installed, Setuptools is automatically installed.
    • lxml. Most Linux distributions bring their own lxml. If missing, see http://lxml.de/installation.html
    • Openssl. Systems other than Windows (see the Platform Installation Guide) are already available.

You can use Pip to install Scrapy (it is recommended to use PIP to install the Python package).

pip Install scrapy

Installation process under Windows:

1. After installing Python 2.7, you need to modify the PATH environment variables to add Python executable programs and additional scripts to the system path. Add the following path to PATH the:

C:\Python27\; C:\Python27\Scripts\;

In addition, you can use the cmd command to set the path:

C:\python27\python.exe c:\python27\tools\scripts\win_add2path.py

After the installation configuration is complete, you can execute the command Python--version to view the version of Python installed. ()

2. Install Pywin32 from http://sourceforge.net/projects/pywin32/

Please make sure to download the version that matches your system (Win32 or AMD64)

Installing PIP from https://pip.pypa.io/en/latest/installing.html

3. Open the Command Line window and confirm that it pip is installed correctly:

Pip--version

4, so far Python 2.7 and pip has been able to run correctly. Next Install Scrapy:

pip Install scrapy

The Scrapy installation at this point in Windows has ended.

Three, scrapy introductory tutorial

1. Create Scrapy project in CMD.

scrapy startproject Tutorial

h:\python\scrapydemo>'tutorial'f:\\python27\\lib\\ Site-packages\\scrapy\\templates\\project'in:    H:\python\scrapyDemo\tutorialYou Can start your first spider with:    CD tutorial    scrapy genspider example example.com

2, the file directory structure is as follows:.

Parsing the SCRAPY framework structure:

    • scrapy.cfg: The configuration file for the project.
    • tutorial/: The Python module for the project. You will then join the code here.
    • tutorial/items.py: Item file in the project.
    • tutorial/pipelines.py: The pipelines file in the project.
    • tutorial/settings.py: The setup file for the project.
    • tutorial/spiders/: The directory where the spider code is placed.

3, write a simple crawler

1. In item.py, configure the field instances where the pages are to be collected.

1#-*-coding:utf-8-*-2 3# Define here the Models forYour scraped items4 #5# See documentationinch:6# http://doc.scrapy.org/en/latest/topics/items.html7 8 Import Scrapy9  fromScrapy.item Import Item, FieldTen  One classTutorialitem (Item): Atitle =Field () -Author =Field () -ReleaseDate = Field ()

2. Write the website to be collected in tutorial/spiders/spider.py and collect each field separately.

1 #-*-coding:utf-8-*-2 ImportSYS3  fromScrapy.linkextractors.sgmlImportSgmllinkextractor4  fromScrapy.spidersImportCrawlspider, Rule5  fromTutorial.itemsImportTutorialitem6 7 Reload (SYS)8Sys.setdefaultencoding ("Utf-8")9 Ten  One classListspider (crawlspider): A     #Reptile Name -Name ="Tutorial" -     #Set Download Delay theDownload_delay = 1 -     #Allow domain names -Allowed_domains = ["news.cnblogs.com"] -     #Start URL +Start_urls = [ -         "https://news.cnblogs.com" +     ] A     #crawl rule, without callback, to recursively crawl to the class URL atRules = ( -Rule (Sgmllinkextractor (allow= (R'https://news.cnblogs.com/n/page/\d',))), -Rule (Sgmllinkextractor (allow= (R'https://news.cnblogs.com/n/\d+',)), callback='parse_content'), -     ) -  -     #Parsing content Functions in     defparse_content (Self, Response): -item =Tutorialitem () to  +         #Current URL -title = Response.selector.xpath ('//div[@id = "News_title"]') [0].extract (). Decode ('Utf-8') theitem['title'] =title *  $Author = Response.selector.xpath ('//div[@id = "News_info"]/span/a/text ()') [0].extract (). Decode ('Utf-8')Panax Notoginsengitem['author'] =author -  theReleaseDate = Response.selector.xpath ('//div[@id = "News_info"]/span[@class = "Time"]/text ()') [0].extract (). Decode ( +             'Utf-8') Aitem['ReleaseDate'] =ReleaseDate the  +         yieldItem

3. Save the data in the tutorial/pipelines.py pipeline.

1 #-*-coding:utf-8-*-2 3 #Define your item pipelines here4 #5 #Don ' t forget to add your pipeline to the Item_pipelines setting6 #see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html7 ImportJSON8 ImportCodecs9 Ten  One classTutorialpipeline (object): A     def __init__(self): -Self.file = Codecs.open ('Data.json', mode='WB', encoding='Utf-8')#数据存储到data. JSON  -  the     defProcess_item (self, item, spider): -line = Json.dumps (Dict (item)) +"\ n" -Self.file.write (Line.decode ("Unicode_escape")) -  +         returnItem

4. Configure the execution environment in tutorial/settings.py.

1 #-*-coding:utf-8-*-2 3Bot_name ='Tutorial'4 5Spider_modules = ['tutorial.spiders']6Newspider_module ='tutorial.spiders'7 8 #prevent cookies from being banned.9cookies_enabled =FalseTenCookies_enables =False One  A #set pipeline, where data is written to the file -Item_pipelines = { -     'Tutorial.pipelines.TutorialPipeline': 300 the } -  - #setting the maximum depth of crawler crawls -Depth_limit = 100

5, the new main file execution crawler code.

1  from Import CmdLine 2 cmdline.execute ("scrapy Crawl tutorial". Split ())

Finally, the JSON data for the acquisition result is obtained in the Data.json file after executing main.py.

Python's scrapy crawler frame installation and simple use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.