Sesame http: Analyzing ajax crawling to the top of today's headline street picture

Last Update:2018-03-14 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this section, let's take the example of today's headlines to try to crawl Web page data by analyzing AJAX requests. The goal of this crawl is to get a picture of the street in the headlines today, and after the crawl is complete, download each group of pictures to the local folder and save it.

1. Preparatory work

Before you begin this section, make sure that you have installed the requests library.

2. Practical Walkthrough

First, implement get_page() the method to load the results of a single AJAX request. The only change is the parameter offset , so we pass it as a parameter, implemented as follows:

Import Requests fromurllib.parse Import urlencodedef get_page (offset):params= {        'Offset': Offset,'format':'JSON',        'keyword':'Street Pat',        'AutoLoad':'true',        'Count':' -',        'Cur_tab':'1',} URL='http://www.toutiao.com/search_content/?'+ UrlEncode (params)    Try: Response= requests.Get(URL)ifResponse.status_code = = $:            returnResponse.json () except requests. Connectionerror:returnNone

Here we use the urlencode() method to construct the requested get parameter, and then request the link with requests, if the return status code is 200, then response the method called will convert the result to json() JSON format, and then return.

Next, implement an analytic method: Extract each image_detail picture link in the field of each data, return the picture link and the title of the picture, and construct a generator. The implementation code is as follows:

def get_images (JSON):ifJson.Get('Data'):         forIteminchJson.Get('Data'): Title= Item.Get('title') Images= Item.Get('Image_detail')             forImageinchImages:yield {                    'Image': Image.Get('URL'),                    'title': Title}

Next, implement a method for saving the picture save_image() , which item is get_images() a dictionary returned by the previous method. In this method, based on item the first title to create a folder, and then request the image link, get the binary data of the picture, write the file in binary form. The name of the picture can use the MD5 value of its content, which removes duplicates. The relevant code is as follows:

Import OS fromhashlib Import md5def save_image (item):ifNot os.path.exists (item.Get('title'): Os.mkdir (item.Get('title'))    Try: Response= requests.Get(item.Get('Image'))        ifResponse.status_code = = $: File_path='{0}/{1}. {2}'. Format (item.Get('title'), MD5 (response.content). Hexdigest (),'jpg')            ifNot os.path.exists (file_path): With open (File_path,'WB') asf:f.write (response.content)Else: Print ('already downloaded', File_path) except requests. Connectionerror:print ('Failed to Save Image')

Finally, you only need to construct an offset array, traverse offset , extract the image link, and download it:

 fromMultiprocessing.pool Import pooldef Main (offset): JSON=get_page (offset) forIteminchget_images (JSON): Print (item) save_image (item) Group_start=1Group_end= -if__name__ = ='__main__': Pool=Pool () groups= ([x * -  forXinchRange (Group_start, Group_end +1)]) Pool.map (main, groups) Pool.close () Pool.join ()

This defines the starting and ending pages of pagination, respectively, GROUP_START and GROUP_END also utilizes a multithreaded thread pool to invoke its methods for multi- map() threaded downloads.

So the whole program is finished, after the run can be found that the street beat the picture of the folder saved.

Finally, we give the code address for this section: Https://github.com/Python3WebSpider/Jiepai.

Sesame http: Analyzing ajax crawling to the top of today's headline street picture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More