Preface: The name of the Python crawler framework has long been heard. In recent days, I have learned the framework of the Scrapy crawler, and I will share with you what I understand. There is an improper expression, hope that the great gods treatise.
First, a glimpse of scrapy
Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler.
This document will give you an idea of how it works by introducing the concepts behind scrapy and determine if scrapy is what you need.
When you are ready to start your project, you can refer to the Getting Started tutorial.
Second, Scrapy installation introduction
Scrapy Framework operating platform and related auxiliary tools
- Python 2.7 (Python latest version 3.5, 2.7 version selected here)
- Python Package:pip and Setuptools. The PIP now relies on setuptools, and if it is not installed, Setuptools is automatically installed.
- lxml. Most Linux distributions bring their own lxml. If missing, see http://lxml.de/installation.html
- Openssl. Systems other than Windows (see the Platform Installation Guide) are already available.
You can use Pip to install Scrapy (it is recommended to use PIP to install the Python package).
pip Install scrapy
Installation process under Windows:
1. After installing Python 2.7, you need to modify the PATH
environment variables to add Python executable programs and additional scripts to the system path. Add the following path to PATH
the:
C:\Python27\; C:\Python27\Scripts\;
In addition, you can use the cmd command to set the path:
C:\python27\python.exe c:\python27\tools\scripts\win_add2path.py
After the installation configuration is complete, you can execute the command Python--version to view the version of Python installed. ()
2. Install Pywin32 from http://sourceforge.net/projects/pywin32/
Please make sure to download the version that matches your system (Win32 or AMD64)
Installing PIP from https://pip.pypa.io/en/latest/installing.html
3. Open the Command Line window and confirm that it pip
is installed correctly:
Pip--version
4, so far Python 2.7 and pip
has been able to run correctly. Next Install Scrapy:
pip Install scrapy
The Scrapy installation at this point in Windows has ended.
Three, scrapy introductory tutorial
1. Create Scrapy project in CMD.
scrapy startproject Tutorial
H:\python\scrapydemo>scrapy startproject tutorialnew scrapy project ' Tutorial ', using template directory ' f:\\ Python27\\lib\\site-packages\\scrapy\\templates\\project ', created in: H:\python\scrapyDemo\tutorialYou can Start your first spider with: CD tutorial scrapy genspider Example example.com
2, the file directory structure is as follows:.
Parsing the SCRAPY framework structure:
scrapy.cfg
: The configuration file for the project.
tutorial/
: The Python module for the project. You will then join the code here.
tutorial/items.py
: Item file in the project.
tutorial/pipelines.py
: The pipelines file in the project.
tutorial/settings.py
: The setup file for the project.
tutorial/spiders/
: The directory where the spider code is placed.
3, write a simple crawler
1. In item.py, configure the field instances where the pages are to be collected.
1 #-*-Coding:utf-8-*-2 3 # Define Here the models for your scraped items 4 # 5 # See documentation In:6 # http:/ /doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 from Scrapy.item Import Item, Field10 one class Tutoria LItem (Item): title = field () author = field () releasedate = field ()
2. Write the website to be collected in tutorial/spiders/spider.py and collect each field separately.
1 #-*-coding:utf-8-*-2 import sys 3 from SCRAPY.LINKEXTRACTORS.SGML import sgmllinkextractor 4 from Scrapy.spiders impo RT Crawlspider, Rule 5 from Tutorial.items import Tutorialitem 6 7 reload (SYS) 8 sys.setdefaultencoding ("Utf-8") 9 10 11 Class Listspider (Crawlspider): 12 # crawler name @ name = "Tutorial" 14 # Set Download delay = 116 # Allow domain 17 Allowed_domains = ["news.cnblogs.com"]18 # start URL19 start_urls = ["https://news.cnblogs.com" 21 ]22 # Crawl rule, without callback, means recursively crawling to the class URL (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n /page/\d ',)), Rule (Sgmllinkextractor (allow= (R ' https://news.cnblogs.com/n/\d+ ')), callback= ' parse_content '), 26) 27 28 # Parse content function def parse_content (self, Response): $ item = Tutorialitem () 31 32 # Current URL33 title = Response.selector.xpath ('//div[@id = "News_title"] ') [0].extract (). Decode (' utf-8 ') item[' title '] = Title35 author= Response.selector.xpath ('//div[@id = "News_info"]/span/a/text () ') [0].extract (). Decode (' Utf-8 ') Notoginseng item[' author '] = Author38 releasedate = Response.selector.xpath ('//div[@id = "News_info"]/span[@class = "Time"]/text () ') [0].extract (). Decode (+ ' utf-8 ') item[' releasedate '] = Releasedate42 yield item
3. Save the data in the tutorial/pipelines.py pipeline.
1 #-*-Coding:utf-8-*-2 3 # Define your item pipelines here 4 # 5 # Don ' t forget to add your pipeline to the Item_ Pipelines Setting 6 # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 Import JSON 8 import codecs 9 10 11 Class Tutorialpipeline (object): def __init__ (self): self.file = Codecs.open (' Data.json ', mode= ' WB ', encoding= ' Utf-8 ') #数据存储到data. Json14 def process_item (self, item, spider): Line = Json.dumps (Dict (item)) + "\ n" self.file.write (Line.decode ("Unicode_escape")) return item
4. Configure the execution environment in tutorial/settings.py.
1 #-*-Coding:utf-8-*-2 3 bot_name = ' Tutorial ' 4 5 spider_modules = [' Tutorial.spiders '] 6 newspider_module = ' Tutorial.spiders ' 7 8 # Prohibit COOKIES, prevent being banned 9 cookies_enabled = False10 Cookies_enables = False11 12 # Set pipeline, here to implement the number Write file Item_pipelines = { tutorial.pipelines.TutorialPipeline ': 30015}16 17 # Set the maximum depth of crawler crawl Depth_limit = 100
5, the new main file execution crawler code.
1 from scrapy import cmdline 2 cmdline.execute ("Scrapy crawl Tutorial". Split ())
Finally, the JSON data for the acquisition result is obtained in the Data.json file after executing main.py.
Original link: http://www.cnblogs.com/liruihua/p/5957393.html
Python's scrapy crawler frame installation and simple use