I. Scrapy INTRODUCTION and Deployment environment
Scrapy is a third-party crawler framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
Originally designed for page fetching (more specifically, network crawling), it can also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing
This environment is built on Ubuntu 16.04 TLS x64 system, Python 2.7.12
Ii. installation of the Scrpay framework
Mainly based on Pip to install PIP is not installed before installing Pip, can refer to here http://dyc2005.blog.51cto.com/270872/1940870 no longer detailed.
Installing Scrapy
#sudo apt-get Install Libssl-dev #依赖组件 #pip install pyopenssl--upgradepip install scrapy-i http://pypi.douban.com #用 Domestic source Installation
The installation is complete to view the following version:
650) this.width=650; "title=" 1.png "alt=" E34effd746ac84df9cfafa352c129843.png "src=" https://s5.51cto.com/oss/ 201710/24/e34effd746ac84df9cfafa352c129843.png "/>
Description Scrapy 1.4.0 is already installed.
Application of Scrapy project based on a city
Based on Scrapy crawl embarrassing Encyclopedia of embarrassing map source files, and analyze the image address.
1. Switch directory to the home directory to create Testspider
#scrapy Startproject Testspider
650) this.width=650; "title=" Start2.png "alt=" 653c89581b2f141031327bcec3052682.png "src=" https://s4.51cto.com/oss/ 201710/24/653c89581b2f141031327bcec3052682.png "/> Use Pycharm to open Testspider
2. Directory Structure and Description:
Directory structure
650) this.width=650; "title=" 3333.png "alt=" Cfaafa76c894720736ec978ae3ef45a2.png "src=" https://s4.51cto.com/oss/ 201710/24/cfaafa76c894720736ec978ae3ef45a2.png "/>
Structure Description:
Testspider shell of the project
Testspider project Directory
Spiders Crawler Authoring Directory
__init__.py Package Files
__init__.py Package Files
item.py data Model files
middlewares.py Middleware Files Proxy Agent IP
pipelines.py Data output pipeline file
settings.py configuration file for the project
scrapy.cfg scapy the configuration file
3, crawl embarrassing encyclopedia of embarrassing pictures
To determine what to crawl:
https://www.qiushibaike.com/
https://www.qiushibaike.com/pic/
4. Modification
settings.py file Remove comments plus the following content,
650) this.width=650; "Width=" 1131 "height=" 213 "title=" Selection _008.png "style=" width:703px;height:170px; "alt=" 74151b91f922dc2fbc1a4c8c24c32bf3.png "src=" Https://s1.51cto.com/oss/201710/24/74151b91f922dc2fbc1a4c8c24c32bf3.png "/>
User-agent: It's your request header when you visit the embarrassing encyclopedia. Copy it to here.
5. Determine the data model
Configuring the item.py File
650) this.width=650; "title=" Selection _010.png "alt=" 3d465d5fbe209fd84b7e149bd2510850.png-wh_ "src=" https://s1.51cto.com/ Oss/201710/24/3d465d5fbe209fd84b7e149bd2510850.png-wh_500x0-wm_3-wmp_4-s_1422057281.png "/>
where IMG defines the crawled to the picture field
Lable is the title field of the picture
6. Crawler code
To the Spiders directory to create the qiushispider.py
#coding: Utf-8import scrapyclass qiushiscrapy (scrapy.spiders.Spider): name = "qiushi_img" #爬虫的标识 allowed_domains = ["https:/ /www.qiushibaike.com "] start_urls = [" https://www.qiushibaike.com/pic/"] #开始爬取的链接 "" # #1, simply save the crawled source file def parse (self,response): #当服务器进行响应的时候, first return to here Scrapy will call this method, and pass the content of the response to the first parameter namely Response with open ("1.txt", "W") as f: f.write (response.body) #爬取的结果写到1 .txt "" " ## #2, Get and extract picture Url def parse (self, response): img_list = response.xpath ("//img") with open ("1.txt", "W") as f: for img in img_list: # f.write (Img.extract (). Encode ("Utf-8") + "\ n") src = img.xpath ("@src" ) content = src.extract () if content: f.write (Content[0].encode ("Utf-8") + "\ n")
Switch to the command line
[Email protected]:~/testspider$ scrapy List
Qiushi_img
You'll see the crawler name qiushi_img.
Execute crawler under command line
[Email protected]:~/testspider$ scrapy Crawl qiushi_img
The results are similar to:
650) this.width=650; "title=" Selection _012.png "alt=" 97d3029d663ab186e595e670950f1071.png-wh_ "src=" https://s3.51cto.com/ Oss/201710/24/97d3029d663ab186e595e670950f1071.png-wh_500x0-wm_3-wmp_4-s_1268105128.png "/>
The resulting 1.txt content is similar to this:
pic.qiushibaike.com/system/avtnew/1240/12405295/medium/20170630013041.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682883/medium/app119682883.jpg//pic.qiushibaike.com/system/avtnew/3114/31145101/medium/ 20170916180203.jpeg//pic.qiushibaike.com/system/pictures/11968/119682882/medium/app119682882.jpg// pic.qiushibaike.com/system/avtnew/3114/31145101/medium/20170916180203.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682878/medium/app119682878.jpg//pic.qiushibaike.com/system/avtnew/3440/34405509/medium/ Nopic.jpg//pic.qiushibaike.com/system/pictures/11968/119682886/medium/app119682886.jpg ..... Omit part
This article is just a simple introduction to a basic reptile process.
is based on command-line operation, not very convenient. Can I add to the Pycharm mode?
Iv. add scrapy mode to Pycharm
Create run.py content in the Testspider Project outer directory as follows:
#coding: Utf-8from scrapy import Cmdlinecmdline.execute ("Scrapy crawl Qiushi_img". Split ())
Click "Run" in the menu bar and "Edit configurations":
650) this.width=650; "title=" 11111111.png "alt=" 151888a98a0e99a75e91d97593d77999.png-wh_ "src=" https:// S4.51cto.com/oss/201710/24/151888a98a0e99a75e91d97593d77999.png-wh_500x0-wm_3-wmp_4-s_1387443569.png "/>
650) this.width=650; "title=" 22222222222.png "alt=" 89fd5ce6f169029099f1dd1726a4851b.png-wh_ "src=" https:// S1.51cto.com/oss/201710/24/89fd5ce6f169029099f1dd1726a4851b.png-wh_500x0-wm_3-wmp_4-s_805446640.png "/>
Point "+"--"python"
650) this.width=650; "title=" 333333333333333.png "alt=" Ba5a4a8cc66d125037bfcc6489362a0b.png-wh_ "src=" https:// S5.51cto.com/oss/201710/24/ba5a4a8cc66d125037bfcc6489362a0b.png-wh_500x0-wm_3-wmp_4-s_1511227317.png "/>
Name: "Unamed" This is not modified by default, Script: Select run.py directory and working directory:
This is just as easy as any other project to run the spider in Pycharm
650) this.width=650; "title=" 44444444444444444445.png "alt=" Ed6f4afffb475e69c1411b7178a66815.png-wh_ "src=" https:/ /s1.51cto.com/oss/201710/24/ed6f4afffb475e69c1411b7178a66815.png-wh_500x0-wm_3-wmp_4-s_2157805324.png "/>
A scrapy framework is built and the basic crawler and PYCHARM environment is deployed. There is no storage involved in the content. Follow-up further explanation.
This article is from the "Learning, learning" blog, please be sure to keep this source http://dyc2005.blog.51cto.com/270872/1975479
Construction of scrapy Environment under Ubuntu16.04