Introduction of Scrapy crawler based on Python

Last Update:2018-03-29 Source: Internet

Author: User

Tags virtual environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(i) Content analysis

Next, create a crawler project that captures the images in the case of the insect web. In the top menu "find" "label" is the classification of various pictures, click on a tag, such as "Python Video Course", the link to the Web page is: Http://www.codingke.com/Python Video Course/, we use this as a crawler entrance, analysis of the page:

Open the page after a set of images, click on the atlas to browse the picture in full screen, scroll down 2 pages will appear more Atlas, there is no page paging settings. Chrome Right-click "Check Elements" to open the Developer tool, check the page source code, the content is as follows:

<divclass= "Content" >

<divclass= "Widget-gallery" >

<ulclass= "Pagelist-wrapper" >

<liclass= "Gallery-item ...

It can be judged that each li.gallery-item is a portal to an atlas, stored under Ul.pagelist-wrapper, Div.widget-gallery is a container, if the XPath selection should be://div[@class = " Widget-gallery "]/ul/li, according to the logic of the General page, find the corresponding link address below the Li.gallery-item, and then drill down into a layer of pages to grab the picture.

But if the page is requested using an HTTP debugging tool like Postman, the content is:

<divclass= "Content" >

<divclass= "Widget-gallery" ></div>

</div>

That is, there is no actual Atlas content, so it can be concluded that the page uses an AJAX request, only when the browser loads the page will not request the Atlas content and join Div.widget-gallery, through the developer tool to view the XHR request address:

Http://www.codingke.com/Python Video Course/posts?page=1&count=20&order=weekly&before_timestamp=

The parameters are very simple, page is page number, count is the number of Atlas per page, order is sort, before_timestamp is empty, the bug is a push content site, so before_timestamp should be a time value, different time will show different content, Here we discard it without considering the time to crawl forward directly from the newest page.

The request result is JSON format content, reduces the crawl difficulty, the result is as follows:

{

"Postlist": [

{

"post_id": "15624611",

"Type": "Multi-photo",

"url": "http://www.codingke.com/",

"site_id": "443122",

"author_id": "443122",

"Published_at": "2017-10-2818:01:03",

"Excerpt": "October 18",

"Favorites": 4052,

"Comments": 353,

"Rewardable": true,

"Parent_comments": "165",

"Rewards": "2",

"Views": 52709,

"title": "The breeze is not dry autumn just right",

"Image_count": 15,

"Images": [

{

"img_id": 11585752,

"user_id": 443122,

"title": "",

"Excerpt": "",

"width": 5016,

"Height": 3840

{

"img_id": 11585737,

"user_id": 443122,

"title": "",

"Excerpt": "",

"width": 3840,

"Height": 5760

...

"Title_image": null,

"Tags": [

{

"tag_id": 131,

"Type": "Subject",

"Tag_name": "Portrait",

"Event_type": "",

"Vote": ""

{

"tag_id": 564,

"Type": "Subject",

"Tag_name": "Beauty",

"Event_type": "",

"Vote": ""

}

"Favorite_list_prefix": [],

"Reward_list_prefix": [],

"Comment_list_prefix": [],

"Cover_image_src": "Http://www.codingke.com/Python Video Course/",

"Is_favorite": false

}

"Sitelist": {...},

"Following": false,

"Coverurl": "Http://www.codingke.com/Python Video Course/",

"Tag_name": "Beauty",

"tag_id": "564",

"url": "https://tuchong.com/tags/%E7%BE%8E%E5%A5%B3/",

"More": true,

"Result": "SUCCESS"

}

Depending on the property name it is easy to know the meaning of the corresponding content, here we just care about the Postlist property, which corresponds to an array element is an atlas, there are several properties of the Atlas element we need to use:

URL: The page address of a single Atlas view

POST_ID: The atlas number, which should be unique in the site, can be used to determine if the content has been crawled

SITE_ID: Author site number, build image source link to use

Title: Caption

Excerpt: summary text

Type: The types of Atlas, currently found two, a multi-photo is a pure photo, a text is a mixture of words and pictures of the article page, two content structure, the need for different crawl mode, in this case only capture the pure photo type, text type directly discarded

Tags: Atlas label, there are multiple

Image_count: Number of pictures

Images: Picture list, which is an array of objects, each containing a img_id attribute needs to be used

Based on the Image view page analysis, the basic film address is this format: https://photo.tuchong.com/{site_id}/f/{img_id}.jpg, it is easy to synthesize through the above information.

(ii) Creation of the project

Enter the Cmder command-line tool and enter Workonscrapy to enter the previously established virtual environment, where the (Scrapy) identity is displayed before the command-line prompt, and the associated path is added to the PATH environment variable for easy development and use in the virtual environment.

Enter Scrapystartprojecttuchong to create the project Tuchong

Enter the project home directory, input scrapygenspiderphototuchong.com create a crawler name called photo (cannot be the same name as the project), Crawl the tuchong.com domain name (this needs to be modified, here first to lose a general address), a project can contain multiple crawlers

After the above steps, the project automatically set up some files and settings, the directory structure is as follows:

(PROJECT)

│scrapy.cfg

│

└─tuchong

│items.py

│middlewares.py

│pipelines.py

│settings.py

│init. py

│

├─spiders

││photo.py

││init. py

││

│└─Pycache

│init. Cpython-36.pyc

│

└─Pycache

Settings.cpython-36.pyc

　　init. Cpython-36.pyc

SCRAPY.CFG: Basic Settings

ITEMS.PY: Structure definition of crawl entries

middlewares.py: Middleware definition, no changes in this example

pipelines.py: Pipeline definition for processing after data capture

settings.py: Global Settings

spidersphoto.py: Crawler body, defining how to crawl required data

Create a Tuchongitem class in items.py and define the desired attributes, which can be characters, numbers, or lists or dictionaries, and so on, from the Scrapy.field value:

Importscrapy

Classtuchongitem (scrapy. Item):

Post_id=scrapy. Field ()

Site_id=scrapy. Field ()

Title=scrapy. Field ()

Type=scrapy. Field ()

Url=scrapy. Field ()

Image_count=scrapy. Field ()

Images=scrapy. Field ()

Tags=scrapy. Field ()

Excerpt=scrapy. Field ()

...

The values of these properties are given in the crawler body.

spidersphoto.py This file is automatically created by command scrapygenspiderphototuchong.com, the initial content is as follows:

Importscrapy

Classphotospider (scrapy. Spider):

Name= ' Photo '

allowed_domains=[' tuchong.com ']

start_urls=[' http://tuchong.com/']

Defparse (Self,response):

Pass

The crawler name, which allows the domain name Allowed_domains (if the link does not belong to this domain name will be discarded, allows multiple), the starting address Start_urls will fetch from the address defined here (allow multiple)

The function parse is the default callback function that handles the request content, the parameter response is the request content, the page content text is saved in the response.body, we need to modify the default code slightly, let it satisfy the multi-page loop send request, this need to overload Start_ The requests function, which constructs a multi-page link request through a looping statement, modifies the following code:

Importscrapy,json

From.. Itemsimporttuchongitem

Classphotospider (scrapy. Spider):

Name= ' Photo '

#allowed_domains =[' tuchong.com ']

#start_urls =[' http://tuchong.com/']

Defstart_requests (self):

Url= ' https://tuchong.com/rest/tags/%s/posts?page=%d&count=20&order=weekly ';

#抓取10个页面, 20 atlas per page

#指定parse作为回调函数并返回Requests请求对象

Forpageinrange (1,11):

Yieldscrapy. Request (url=url% (' Belle ', page), Callback=self.parse)

#回调函数, handling crawl content fill Tuchongitem property

Defparse (Self,response):

Body=json.loads (Response.body_as_unicode ())

Items=[]

forpostinbody[' Postlist ']:

Item=tuchongitem ()

item[' type ']=post[' type ']

item[' post_id ']=post[' post_id ']

item[' site_id ']=post[' site_id ']

item[' title ']=post[' title ']

item[' url ']=post[' url ']

item[' excerpt ']=post[' excerpt ']

item[' Image_count ']=int (post[' Image_count '])

item[' Images ']={}

#将images处理成 {Img_id:img_url} object array

Forimginpost.get (' images ', '):

img_id=img[' img_id ']

Url= ' https://photo.tuchong.com/%s/f/%s.jpg '% (item[' site_id '],img_id)

item[' Images '][img_id]=url

item[' tags ']=[]

#将tags处理成tag_name数组

Fortaginpost.get (' tags ', '):

item[' tags '].append (tag[' tag_name ')

Items.append (item)

Returnitems

After these steps, the captured data will be stored in the Tuchongitem class as structured data for easy processing and preservation.

As I said earlier, not all of the crawled items need to be, such as in this case we just need to type= "Multi_photo type of Atlas, and the picture is too little to need, The filtering operations of these crawl entries and how to save them need to be handled in pipelines.py, which by default has created the class Tuchongpipeline and overloaded the Process_item function by modifying the function to return only those items that match the code as follows:

...

Defprocess_item (Self,item,spider):

#不符合条件触发scrapy. Exceptions.dropitem exception, eligible output address

Ifint (item[' Image_count ') <3:

Raisedropitem ("Beauty Too little:" +item[' URL ')

elifitem[' type ']!= ' Multi-photo ':

Raisedropitem ("Format not:" ++item[' URL ')

Else

Print (item[' url ')

Returnitem

...

Of course, if you do not use the pipe directly in the parse process is the same, but the structure is more clear, and there are more functions of filepipelines and imagepipelines available, Process_item will be triggered after each item crawl, There are also open_spider and Close_spider functions that can be overloaded to handle the action of the crawler when it is opened and closed.

Note: Pipelines need to be registered in the project in order to be used and added in settings.py:

item_pipelines={

' Tuchong.pipelines.TuchongPipeline ': #管道名称: Operation Priority (digital small first)

}

In addition, most sites have anti-crawler Robots.txt exclusion Protocol, set Robotstxt_obey=true can ignore these protocols, yes, it seems to be just a gentlemen's agreement. If the site is set up browser useragent or IP address detection to reverse the crawler, it needs more advanced scrapy function, this article does not explain.

(iv) operation

Return to the Cmder command line to enter the project directory, enter the command:

Scrapycrawlphoto

The terminal will output all the crawling results and debug information, and at the end of the list of crawler running statistics, such as:

[Scrapy.statscollectors] Info:dumpingscrapystats:

{' downloader/request_bytes ': 491,

' Downloader/request_count ': 2,

' Downloader/request_method_count/get ': 2,

' Downloader/response_bytes ': 10224,

' Downloader/response_count ': 2,

' downloader/response_status_count/200 ': 2,

' Finish_reason ': ' Finished ',

' Finish_time ':d atetime.datetime (2017,11,27,7,20,24,414201),

' Item_dropped_count ': 5,

' Item_dropped_reasons_count/dropitem ': 5,

' Item_scraped_count ': 15,

' Log_count/debug ': 18,

' Log_count/info ': 8,

' Log_count/warning ': 5,

' Response_received_count ': 2,

' scheduler/dequeued ': 1,

' Scheduler/dequeued/memory ': 1,

' scheduler/enqueued ': 1,

' Scheduler/enqueued/memory ': 1,

' Start_time ':d atetime.datetime (2017,11,27,7,20,23,867300)}

The main concern is error and warning two, here the warning is actually not meet the conditions and trigger the Dropitem exception.

(v) Save results

In most cases, you will need to save the results of the fetch, and by default the properties defined in item.py can be saved to the file, only the command line plus the parameter-o{filename} is required:

scrapycrawlphoto-ooutput.json# Output as JSON file

scrapycrawlphoto-ooutput.csv# Output to CSV file

Note: Items that are output to a file are items that are not tuchongpipeline filtered, as long as the item returned in the parse function is output, so you can also filter in parse to return only the items you want

If you need to save to a database, you need to add additional code processing, such as you can add after Process_item in pipelines.py:

...

Defprocess_item (Self,item,spider):

...

Else

Print (item[' url ')

Self.myblog.add_post (item) #myblog是一个数据库类 for working with database operations

Returnitem

...

In order to exclude duplicates in the Insert database operation, you can use item[' post_id '] to determine if there is a skip.

Introduction of Scrapy crawler based on Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More