Construction of scrapy Environment under Ubuntu16.04

Last Update:2017-10-24 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Scrapy INTRODUCTION and Deployment environment

Scrapy is a third-party crawler framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
Originally designed for page fetching (more specifically, network crawling), it can also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing

This environment is built on Ubuntu 16.04 TLS x64 system, Python 2.7.12

Ii. installation of the Scrpay framework

Mainly based on Pip to install PIP is not installed before installing Pip, can refer to here http://dyc2005.blog.51cto.com/270872/1940870 no longer detailed.

Installing Scrapy

#sudo apt-get Install Libssl-dev #依赖组件 #pip install pyopenssl--upgradepip install scrapy-i http://pypi.douban.com #用 Domestic source Installation

The installation is complete to view the following version:

650) this.width=650; "title=" 1.png "alt=" E34effd746ac84df9cfafa352c129843.png "src=" https://s5.51cto.com/oss/ 201710/24/e34effd746ac84df9cfafa352c129843.png "/>

Description Scrapy 1.4.0 is already installed.

Application of Scrapy project based on a city

Based on Scrapy crawl embarrassing Encyclopedia of embarrassing map source files, and analyze the image address.

1. Switch directory to the home directory to create Testspider

#scrapy Startproject Testspider

650) this.width=650; "title=" Start2.png "alt=" 653c89581b2f141031327bcec3052682.png "src=" https://s4.51cto.com/oss/ 201710/24/653c89581b2f141031327bcec3052682.png "/> Use Pycharm to open Testspider

2. Directory Structure and Description:

Directory structure

650) this.width=650; "title=" 3333.png "alt=" Cfaafa76c894720736ec978ae3ef45a2.png "src=" https://s4.51cto.com/oss/ 201710/24/cfaafa76c894720736ec978ae3ef45a2.png "/>

Structure Description:

Testspider shell of the project

Testspider project Directory

Spiders Crawler Authoring Directory

__init__.py Package Files

item.py data Model files

middlewares.py Middleware Files Proxy Agent IP

pipelines.py Data output pipeline file

settings.py configuration file for the project

scrapy.cfg scapy the configuration file

3, crawl embarrassing encyclopedia of embarrassing pictures

To determine what to crawl:

https://www.qiushibaike.com/

https://www.qiushibaike.com/pic/

4. Modification

settings.py file Remove comments plus the following content,

650) this.width=650; "Width=" 1131 "height=" 213 "title=" Selection _008.png "style=" width:703px;height:170px; "alt=" 74151b91f922dc2fbc1a4c8c24c32bf3.png "src=" Https://s1.51cto.com/oss/201710/24/74151b91f922dc2fbc1a4c8c24c32bf3.png "/>

User-agent: It's your request header when you visit the embarrassing encyclopedia. Copy it to here.

5. Determine the data model

Configuring the item.py File

650) this.width=650; "title=" Selection _010.png "alt=" 3d465d5fbe209fd84b7e149bd2510850.png-wh_ "src=" https://s1.51cto.com/ Oss/201710/24/3d465d5fbe209fd84b7e149bd2510850.png-wh_500x0-wm_3-wmp_4-s_1422057281.png "/>

where IMG defines the crawled to the picture field

Lable is the title field of the picture

6. Crawler code

To the Spiders directory to create the qiushispider.py

#coding: Utf-8import scrapyclass qiushiscrapy (scrapy.spiders.Spider):     name  =  "qiushi_img"     #爬虫的标识     allowed_domains = ["https:/ /www.qiushibaike.com "]    start_urls = [" https://www.qiushibaike.com/pic/"]     #开始爬取的链接      ""     # #1, simply save the crawled source file      def parse (self,response):     #当服务器进行响应的时候, first return to here Scrapy will call this method, and pass the content of the response to the first parameter namely Response        with open ("1.txt", "W")  as  f:            f.write (response.body)          #爬取的结果写到1 .txt                "" "                                       ## #2, Get and extract picture Url    def parse (self, response):         img_list = response.xpath ("//img")         with  open ("1.txt",  "W")  as f:             for img in img_list:                # f.write (Img.extract (). Encode ("Utf-8")  +  "\ n")                  src = img.xpath ("@src" )                 content  = src.extract ()                  if content:                     f.write (Content[0].encode ("Utf-8")  +  "\ n")

Switch to the command line

[Email protected]:~/testspider$ scrapy List
Qiushi_img

You'll see the crawler name qiushi_img.

Execute crawler under command line

[Email protected]:~/testspider$ scrapy Crawl qiushi_img

The results are similar to:

650) this.width=650; "title=" Selection _012.png "alt=" 97d3029d663ab186e595e670950f1071.png-wh_ "src=" https://s3.51cto.com/ Oss/201710/24/97d3029d663ab186e595e670950f1071.png-wh_500x0-wm_3-wmp_4-s_1268105128.png "/>

The resulting 1.txt content is similar to this:

pic.qiushibaike.com/system/avtnew/1240/12405295/medium/20170630013041.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682883/medium/app119682883.jpg//pic.qiushibaike.com/system/avtnew/3114/31145101/medium/ 20170916180203.jpeg//pic.qiushibaike.com/system/pictures/11968/119682882/medium/app119682882.jpg// pic.qiushibaike.com/system/avtnew/3114/31145101/medium/20170916180203.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682878/medium/app119682878.jpg//pic.qiushibaike.com/system/avtnew/3440/34405509/medium/ Nopic.jpg//pic.qiushibaike.com/system/pictures/11968/119682886/medium/app119682886.jpg ..... Omit part

This article is just a simple introduction to a basic reptile process.

is based on command-line operation, not very convenient. Can I add to the Pycharm mode?

Iv. add scrapy mode to Pycharm

Create run.py content in the Testspider Project outer directory as follows:

#coding: Utf-8from scrapy import Cmdlinecmdline.execute ("Scrapy crawl Qiushi_img". Split ())

Click "Run" in the menu bar and "Edit configurations":

650) this.width=650; "title=" 11111111.png "alt=" 151888a98a0e99a75e91d97593d77999.png-wh_ "src=" https:// S4.51cto.com/oss/201710/24/151888a98a0e99a75e91d97593d77999.png-wh_500x0-wm_3-wmp_4-s_1387443569.png "/>

650) this.width=650; "title=" 22222222222.png "alt=" 89fd5ce6f169029099f1dd1726a4851b.png-wh_ "src=" https:// S1.51cto.com/oss/201710/24/89fd5ce6f169029099f1dd1726a4851b.png-wh_500x0-wm_3-wmp_4-s_805446640.png "/>

Point "+"--"python"

650) this.width=650; "title=" 333333333333333.png "alt=" Ba5a4a8cc66d125037bfcc6489362a0b.png-wh_ "src=" https:// S5.51cto.com/oss/201710/24/ba5a4a8cc66d125037bfcc6489362a0b.png-wh_500x0-wm_3-wmp_4-s_1511227317.png "/>

Name: "Unamed" This is not modified by default, Script: Select run.py directory and working directory:

This is just as easy as any other project to run the spider in Pycharm

650) this.width=650; "title=" 44444444444444444445.png "alt=" Ed6f4afffb475e69c1411b7178a66815.png-wh_ "src=" https:/ /s1.51cto.com/oss/201710/24/ed6f4afffb475e69c1411b7178a66815.png-wh_500x0-wm_3-wmp_4-s_2157805324.png "/>

A scrapy framework is built and the basic crawler and PYCHARM environment is deployed. There is no storage involved in the content. Follow-up further explanation.

This article is from the "Learning, learning" blog, please be sure to keep this source http://dyc2005.blog.51cto.com/270872/1975479

Construction of scrapy Environment under Ubuntu16.04

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More