Python's scrapy crawler frame installation and simple use

Source: Internet
Author: User
Tags xpath

Preface: The name of the Python crawler framework has long been heard. In recent days, I have learned the framework of the Scrapy crawler, and I will share with you what I understand. There is an improper expression, hope that the great gods treatise.

First, a glimpse of scrapy

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.

It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler.

This document will give you an idea of how it works by introducing the concepts behind scrapy and determine if scrapy is what you need.

When you are ready to start your project, you can refer to the Getting Started tutorial.

Second, Scrapy installation introduction

Scrapy Framework operating platform and related auxiliary tools

    • Python 2.7 (Python latest version 3.5, 2.7 version selected here)
    • Python Package:pip and Setuptools. The PIP now relies on setuptools, and if it is not installed, Setuptools is automatically installed.
    • lxml. Most Linux distributions bring their own lxml. If missing, see http://lxml.de/installation.html
    • Openssl. Systems other than Windows (see the Platform Installation Guide) are already available.

You can use Pip to install Scrapy (it is recommended to use PIP to install the Python package).

pip Install scrapy

Installation process under Windows:

1. After installing Python 2.7, you need to modify the PATH environment variables to add Python executable programs and additional scripts to the system path. Add the following path to PATH the:

C:\Python27\; C:\Python27\Scripts\;

In addition, you can use the cmd command to set the path:

C:\python27\python.exe c:\python27\tools\scripts\win_add2path.py

After the installation configuration is complete, you can execute the command Python--version to view the version of Python installed. ()

2. Install Pywin32 from http://sourceforge.net/projects/pywin32/

Please make sure to download the version that matches your system (Win32 or AMD64)

Installing PIP from https://pip.pypa.io/en/latest/installing.html

3. Open the Command Line window and confirm that it pip is installed correctly:

Pip--version

4, so far Python 2.7 and pip has been able to run correctly. Next Install Scrapy:

pip Install scrapy

The Scrapy installation at this point in Windows has ended.

Three, scrapy introductory tutorial

1. Create Scrapy project in CMD.

scrapy startproject Tutorial

H:\python\scrapydemo>scrapy startproject tutorialnew scrapy project ' Tutorial ', using template directory ' f:\\ Python27\\lib\\site-packages\\scrapy\\templates\\project ', created in:    H:\python\scrapyDemo\tutorialYou can Start your first spider with:    CD tutorial    scrapy genspider Example example.com

2, the file directory structure is as follows:.

Parsing the SCRAPY framework structure:

    • scrapy.cfg: The configuration file for the project.
    • tutorial/: The Python module for the project. You will then join the code here.
    • tutorial/items.py: Item file in the project.
    • tutorial/pipelines.py: The pipelines file in the project.
    • tutorial/settings.py: The setup file for the project.
    • tutorial/spiders/: The directory where the spider code is placed.

3, write a simple crawler

1. In item.py, configure the field instances where the pages are to be collected.

1 #-*-Coding:utf-8-*-2  3 # Define Here the models for your scraped items 4 # 5 # See documentation In:6 # http:/ /doc.scrapy.org/en/latest/topics/items.html 7  8 import scrapy 9 from Scrapy.item Import Item, Field10 one class Tutoria LItem (Item):     title = field ()     author = field ()     releasedate = field ()

2. Write the website to be collected in tutorial/spiders/spider.py and collect each field separately.

 1 #-*-coding:utf-8-*-2 import sys 3 from SCRAPY.LINKEXTRACTORS.SGML import sgmllinkextractor 4 from Scrapy.spiders impo RT Crawlspider, Rule 5 from Tutorial.items import Tutorialitem 6 7 reload (SYS) 8 sys.setdefaultencoding ("Utf-8") 9 10 11      Class Listspider (Crawlspider): 12 # crawler name @ name = "Tutorial" 14 # Set Download delay = 116 # Allow domain 17     Allowed_domains = ["news.cnblogs.com"]18 # start URL19 start_urls = ["https://news.cnblogs.com" 21 ]22 # Crawl rule, without callback, means recursively crawling to the class URL (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n /page/\d ',)), Rule (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n/\d+ ')), callback= ' parse_content '),          26) 27 28 # Parse content function def parse_content (self, Response): $ item = Tutorialitem () 31 32 # Current URL33  title = Response.selector.xpath ('//div[@id = "News_title"] ') [0].extract (). Decode (' utf-8 ') item[' title '] = Title35 author= Response.selector.xpath ('//div[@id = "News_info"]/span/a/text () ') [0].extract (). Decode (' Utf-8 ') Notoginseng item[' author '] = Author38 releasedate = Response.selector.xpath ('//div[@id = "News_info"]/span[@class = "Time"]/text () ') [0].extract (). Decode (+ ' utf-8 ') item[' releasedate '] = Releasedate42 yield item

3. Save the data in the tutorial/pipelines.py pipeline.

1 #-*-Coding:utf-8-*-2  3 # Define your item pipelines here 4 # 5 # Don ' t forget to add your pipeline to the Item_  Pipelines Setting 6 # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 Import JSON 8 import codecs 9 10 11 Class Tutorialpipeline (object):     def __init__ (self):         self.file = Codecs.open (' Data.json ', mode= ' WB ', encoding= ' Utf-8 ') #数据存储到data. Json14     def process_item (self, item, spider): Line         = Json.dumps (Dict (item)) + "\ n"         self.file.write (Line.decode ("Unicode_escape"))         return item

4. Configure the execution environment in tutorial/settings.py.

1 #-*-Coding:utf-8-*-2  3 bot_name = ' Tutorial ' 4  5 spider_modules = [' Tutorial.spiders '] 6 newspider_module = ' Tutorial.spiders ' 7  8 # Prohibit COOKIES, prevent being banned 9 cookies_enabled = False10 Cookies_enables = False11 12 # Set pipeline, here to implement the number Write file Item_pipelines = {     tutorial.pipelines.TutorialPipeline ': 30015}16 17 # Set the maximum depth of crawler crawl Depth_limit = 100

5, the new main file execution crawler code.

1 from scrapy import cmdline 2 cmdline.execute ("Scrapy crawl Tutorial". Split ())

Finally, the JSON data for the acquisition result is obtained in the Data.json file after executing main.py.

Original link: http://www.cnblogs.com/liruihua/p/5957393.html

Python's scrapy crawler frame installation and simple use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.