Construction of scrapy Environment under Ubuntu16.04

Source: Internet
Author: User
Tags xpath

I. Scrapy INTRODUCTION and Deployment environment

Scrapy is a third-party crawler framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
Originally designed for page fetching (more specifically, network crawling), it can also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing

This environment is built on Ubuntu 16.04 TLS x64 system, Python 2.7.12


Ii. installation of the Scrpay framework

Mainly based on Pip to install PIP is not installed before installing Pip, can refer to here http://dyc2005.blog.51cto.com/270872/1940870 no longer detailed.

Installing Scrapy

#sudo apt-get Install Libssl-dev #依赖组件 #pip install pyopenssl--upgradepip install scrapy-i http://pypi.douban.com #用 Domestic source Installation
The installation is complete to view the following version:

650) this.width=650; "title=" 1.png "alt=" E34effd746ac84df9cfafa352c129843.png "src=" https://s5.51cto.com/oss/ 201710/24/e34effd746ac84df9cfafa352c129843.png "/>

Description Scrapy 1.4.0 is already installed.


Application of Scrapy project based on a city

Based on Scrapy crawl embarrassing Encyclopedia of embarrassing map source files, and analyze the image address.

1. Switch directory to the home directory to create Testspider

#scrapy Startproject Testspider

650) this.width=650; "title=" Start2.png "alt=" 653c89581b2f141031327bcec3052682.png "src=" https://s4.51cto.com/oss/ 201710/24/653c89581b2f141031327bcec3052682.png "/> Use Pycharm to open Testspider

2. Directory Structure and Description:

Directory structure

650) this.width=650; "title=" 3333.png "alt=" Cfaafa76c894720736ec978ae3ef45a2.png "src=" https://s4.51cto.com/oss/ 201710/24/cfaafa76c894720736ec978ae3ef45a2.png "/>

Structure Description:

Testspider shell of the project

Testspider project Directory

Spiders Crawler Authoring Directory

__init__.py Package Files

__init__.py Package Files

item.py data Model files

middlewares.py Middleware Files Proxy Agent IP

pipelines.py Data output pipeline file

settings.py configuration file for the project

scrapy.cfg scapy the configuration file

3, crawl embarrassing encyclopedia of embarrassing pictures

To determine what to crawl:

https://www.qiushibaike.com/

https://www.qiushibaike.com/pic/

4. Modification

settings.py file Remove comments plus the following content,

650) this.width=650; "Width=" 1131 "height=" 213 "title=" Selection _008.png "style=" width:703px;height:170px; "alt=" 74151b91f922dc2fbc1a4c8c24c32bf3.png "src=" Https://s1.51cto.com/oss/201710/24/74151b91f922dc2fbc1a4c8c24c32bf3.png "/>

User-agent: It's your request header when you visit the embarrassing encyclopedia. Copy it to here.


5. Determine the data model

Configuring the item.py File

650) this.width=650; "title=" Selection _010.png "alt=" 3d465d5fbe209fd84b7e149bd2510850.png-wh_ "src=" https://s1.51cto.com/ Oss/201710/24/3d465d5fbe209fd84b7e149bd2510850.png-wh_500x0-wm_3-wmp_4-s_1422057281.png "/>


where IMG defines the crawled to the picture field

Lable is the title field of the picture


6. Crawler code

To the Spiders directory to create the qiushispider.py

#coding: Utf-8import scrapyclass qiushiscrapy (scrapy.spiders.Spider):     name  =  "qiushi_img"     #爬虫的标识     allowed_domains = ["https:/ /www.qiushibaike.com "]    start_urls = [" https://www.qiushibaike.com/pic/"]     #开始爬取的链接      ""     # #1, simply save the crawled source file      def parse (self,response):     #当服务器进行响应的时候, first return to here Scrapy will call this method, and pass the content of the response to the first parameter namely Response        with open ("1.txt", "W")  as  f:            f.write (response.body)          #爬取的结果写到1 .txt                "" "                                       ## #2, Get and extract picture Url    def parse (self, response):         img_list = response.xpath ("//img")         with  open ("1.txt",  "W")  as f:             for img in img_list:                # f.write (Img.extract (). Encode ("Utf-8")  +  "\ n")                  src = img.xpath ("@src" )                 content  = src.extract ()                  if content:                     f.write (Content[0].encode ("Utf-8")  +  "\ n")


Switch to the command line

[Email protected]:~/testspider$ scrapy List
Qiushi_img


You'll see the crawler name qiushi_img.

Execute crawler under command line

[Email protected]:~/testspider$ scrapy Crawl qiushi_img

The results are similar to:

650) this.width=650; "title=" Selection _012.png "alt=" 97d3029d663ab186e595e670950f1071.png-wh_ "src=" https://s3.51cto.com/ Oss/201710/24/97d3029d663ab186e595e670950f1071.png-wh_500x0-wm_3-wmp_4-s_1268105128.png "/>

The resulting 1.txt content is similar to this:

pic.qiushibaike.com/system/avtnew/1240/12405295/medium/20170630013041.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682883/medium/app119682883.jpg//pic.qiushibaike.com/system/avtnew/3114/31145101/medium/ 20170916180203.jpeg//pic.qiushibaike.com/system/pictures/11968/119682882/medium/app119682882.jpg// pic.qiushibaike.com/system/avtnew/3114/31145101/medium/20170916180203.jpeg//pic.qiushibaike.com/system/ pictures/11968/119682878/medium/app119682878.jpg//pic.qiushibaike.com/system/avtnew/3440/34405509/medium/ Nopic.jpg//pic.qiushibaike.com/system/pictures/11968/119682886/medium/app119682886.jpg ..... Omit part

This article is just a simple introduction to a basic reptile process.

is based on command-line operation, not very convenient. Can I add to the Pycharm mode?


Iv. add scrapy mode to Pycharm

Create run.py content in the Testspider Project outer directory as follows:

#coding: Utf-8from scrapy import Cmdlinecmdline.execute ("Scrapy crawl Qiushi_img". Split ())

Click "Run" in the menu bar and "Edit configurations":

650) this.width=650; "title=" 11111111.png "alt=" 151888a98a0e99a75e91d97593d77999.png-wh_ "src=" https:// S4.51cto.com/oss/201710/24/151888a98a0e99a75e91d97593d77999.png-wh_500x0-wm_3-wmp_4-s_1387443569.png "/>


650) this.width=650; "title=" 22222222222.png "alt=" 89fd5ce6f169029099f1dd1726a4851b.png-wh_ "src=" https:// S1.51cto.com/oss/201710/24/89fd5ce6f169029099f1dd1726a4851b.png-wh_500x0-wm_3-wmp_4-s_805446640.png "/>

Point "+"--"python"

650) this.width=650; "title=" 333333333333333.png "alt=" Ba5a4a8cc66d125037bfcc6489362a0b.png-wh_ "src=" https:// S5.51cto.com/oss/201710/24/ba5a4a8cc66d125037bfcc6489362a0b.png-wh_500x0-wm_3-wmp_4-s_1511227317.png "/>

Name: "Unamed" This is not modified by default, Script: Select run.py directory and working directory:

This is just as easy as any other project to run the spider in Pycharm

650) this.width=650; "title=" 44444444444444444445.png "alt=" Ed6f4afffb475e69c1411b7178a66815.png-wh_ "src=" https:/ /s1.51cto.com/oss/201710/24/ed6f4afffb475e69c1411b7178a66815.png-wh_500x0-wm_3-wmp_4-s_2157805324.png "/>


A scrapy framework is built and the basic crawler and PYCHARM environment is deployed. There is no storage involved in the content. Follow-up further explanation.

This article is from the "Learning, learning" blog, please be sure to keep this source http://dyc2005.blog.51cto.com/270872/1975479

Construction of scrapy Environment under Ubuntu16.04

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.