Preface: The name of the Python crawler framework has long been heard. In recent days, I have learned the framework of the Scrapy crawler, and I will share with you what I understand. There is an improper expression, hope that the great gods treatise.
First, a glimpse of scrapy
Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler.
This document will give you an idea of how it works by introducing the concepts behind scrapy and determine if scrapy is what you need.
When you are ready to start your project, you can refer to the Getting Started tutorial.
Second, Scrapy installation introduction
Scrapy Framework operating platform and related auxiliary tools
- Python 2.7 (Python latest version 3.5, 2.7 version selected here)
- Python Package:pip and Setuptools. The PIP now relies on setuptools, and if it is not installed, Setuptools is automatically installed.
- lxml. Most Linux distributions bring their own lxml. If missing, see http://lxml.de/installation.html
- Openssl. Systems other than Windows (see the Platform Installation Guide) are already available.
You can use Pip to install Scrapy (it is recommended to use PIP to install the Python package).
pip Install scrapy
Installation process under Windows:
1. After installing Python 2.7, you need to modify the PATH
environment variables to add Python executable programs and additional scripts to the system path. Add the following path to PATH
the:
C:\Python27\; C:\Python27\Scripts\;
In addition, you can use the cmd command to set the path:
C:\python27\python.exe c:\python27\tools\scripts\win_add2path.py
After the installation configuration is complete, you can execute the command Python--version to view the version of Python installed. ()
2. Install Pywin32 from http://sourceforge.net/projects/pywin32/
Please make sure to download the version that matches your system (Win32 or AMD64)
Installing PIP from https://pip.pypa.io/en/latest/installing.html
3. Open the Command Line window and confirm that it pip
is installed correctly:
Pip--version
4, so far Python 2.7 and pip
has been able to run correctly. Next Install Scrapy:
pip Install scrapy
The Scrapy installation at this point in Windows has ended.
Three, scrapy introductory tutorial
1. Create Scrapy project in CMD.
scrapy startproject Tutorial
h:\python\scrapydemo>'tutorial'f:\\python27\\lib\\ Site-packages\\scrapy\\templates\\project'in: H:\python\scrapyDemo\tutorialYou Can start your first spider with: CD tutorial scrapy genspider example example.com
2, the file directory structure is as follows:.
Parsing the SCRAPY framework structure:
scrapy.cfg
: The configuration file for the project.
tutorial/
: The Python module for the project. You will then join the code here.
tutorial/items.py
: Item file in the project.
tutorial/pipelines.py
: The pipelines file in the project.
tutorial/settings.py
: The setup file for the project.
tutorial/spiders/
: The directory where the spider code is placed.
3, write a simple crawler
1. In item.py, configure the field instances where the pages are to be collected.
1#-*-coding:utf-8-*-2 3# Define here the Models forYour scraped items4 #5# See documentationinch:6# http://doc.scrapy.org/en/latest/topics/items.html7 8 Import Scrapy9 fromScrapy.item Import Item, FieldTen One classTutorialitem (Item): Atitle =Field () -Author =Field () -ReleaseDate = Field ()
2. Write the website to be collected in tutorial/spiders/spider.py and collect each field separately.
1 #-*-coding:utf-8-*-2 ImportSYS3 fromScrapy.linkextractors.sgmlImportSgmllinkextractor4 fromScrapy.spidersImportCrawlspider, Rule5 fromTutorial.itemsImportTutorialitem6 7 Reload (SYS)8Sys.setdefaultencoding ("Utf-8")9 Ten One classListspider (crawlspider): A #Reptile Name -Name ="Tutorial" - #Set Download Delay theDownload_delay = 1 - #Allow domain names -Allowed_domains = ["news.cnblogs.com"] - #Start URL +Start_urls = [ - "https://news.cnblogs.com" + ] A #crawl rule, without callback, to recursively crawl to the class URL atRules = ( -Rule (Sgmllinkextractor (allow= (R'https://news.cnblogs.com/n/page/\d',))), -Rule (Sgmllinkextractor (allow= (R'https://news.cnblogs.com/n/\d+',)), callback='parse_content'), - ) - - #Parsing content Functions in defparse_content (Self, Response): -item =Tutorialitem () to + #Current URL -title = Response.selector.xpath ('//div[@id = "News_title"]') [0].extract (). Decode ('Utf-8') theitem['title'] =title * $Author = Response.selector.xpath ('//div[@id = "News_info"]/span/a/text ()') [0].extract (). Decode ('Utf-8')Panax Notoginsengitem['author'] =author - theReleaseDate = Response.selector.xpath ('//div[@id = "News_info"]/span[@class = "Time"]/text ()') [0].extract (). Decode ( + 'Utf-8') Aitem['ReleaseDate'] =ReleaseDate the + yieldItem
3. Save the data in the tutorial/pipelines.py pipeline.
1 #-*-coding:utf-8-*-2 3 #Define your item pipelines here4 #5 #Don ' t forget to add your pipeline to the Item_pipelines setting6 #see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html7 ImportJSON8 ImportCodecs9 Ten One classTutorialpipeline (object): A def __init__(self): -Self.file = Codecs.open ('Data.json', mode='WB', encoding='Utf-8')#数据存储到data. JSON - the defProcess_item (self, item, spider): -line = Json.dumps (Dict (item)) +"\ n" -Self.file.write (Line.decode ("Unicode_escape")) - + returnItem
4. Configure the execution environment in tutorial/settings.py.
1 #-*-coding:utf-8-*-2 3Bot_name ='Tutorial'4 5Spider_modules = ['tutorial.spiders']6Newspider_module ='tutorial.spiders'7 8 #prevent cookies from being banned.9cookies_enabled =FalseTenCookies_enables =False One A #set pipeline, where data is written to the file -Item_pipelines = { - 'Tutorial.pipelines.TutorialPipeline': 300 the } - - #setting the maximum depth of crawler crawls -Depth_limit = 100
5, the new main file execution crawler code.
1 from Import CmdLine 2 cmdline.execute ("scrapy Crawl tutorial". Split ())
Finally, the JSON data for the acquisition result is obtained in the Data.json file after executing main.py.
Python's scrapy crawler frame installation and simple use