Sesame http: Analyzing ajax crawling to the top of today's headline street picture

Source: Internet
Author: User
Tags urlencode

In this section, let's take the example of today's headlines to try to crawl Web page data by analyzing AJAX requests. The goal of this crawl is to get a picture of the street in the headlines today, and after the crawl is complete, download each group of pictures to the local folder and save it.

1. Preparatory work

Before you begin this section, make sure that you have installed the requests library.

2. Practical Walkthrough

First, implement get_page() the method to load the results of a single AJAX request. The only change is the parameter offset , so we pass it as a parameter, implemented as follows:

Import Requests fromurllib.parse Import urlencodedef get_page (offset):params= {        'Offset': Offset,'format':'JSON',        'keyword':'Street Pat',        'AutoLoad':'true',        'Count':' -',        'Cur_tab':'1',} URL='http://www.toutiao.com/search_content/?'+ UrlEncode (params)    Try: Response= requests.Get(URL)ifResponse.status_code = = $:            returnResponse.json () except requests. Connectionerror:returnNone

Here we use the urlencode() method to construct the requested get parameter, and then request the link with requests, if the return status code is 200, then response the method called will convert the result to json() JSON format, and then return.

Next, implement an analytic method: Extract each image_detail picture link in the field of each data, return the picture link and the title of the picture, and construct a generator. The implementation code is as follows:

def get_images (JSON):ifJson.Get('Data'):         forIteminchJson.Get('Data'): Title= Item.Get('title') Images= Item.Get('Image_detail')             forImageinchImages:yield {                    'Image': Image.Get('URL'),                    'title': Title}

Next, implement a method for saving the picture save_image() , which item is get_images() a dictionary returned by the previous method. In this method, based on item the first title to create a folder, and then request the image link, get the binary data of the picture, write the file in binary form. The name of the picture can use the MD5 value of its content, which removes duplicates. The relevant code is as follows:

Import OS fromhashlib Import md5def save_image (item):ifNot os.path.exists (item.Get('title'): Os.mkdir (item.Get('title'))    Try: Response= requests.Get(item.Get('Image'))        ifResponse.status_code = = $: File_path='{0}/{1}. {2}'. Format (item.Get('title'), MD5 (response.content). Hexdigest (),'jpg')            ifNot os.path.exists (file_path): With open (File_path,'WB') asf:f.write (response.content)Else: Print ('already downloaded', File_path) except requests. Connectionerror:print ('Failed to Save Image')

Finally, you only need to construct an offset array, traverse offset , extract the image link, and download it:

 fromMultiprocessing.pool Import pooldef Main (offset): JSON=get_page (offset) forIteminchget_images (JSON): Print (item) save_image (item) Group_start=1Group_end= -if__name__ = ='__main__': Pool=Pool () groups= ([x * -  forXinchRange (Group_start, Group_end +1)]) Pool.map (main, groups) Pool.close () Pool.join ()

This defines the starting and ending pages of pagination, respectively, GROUP_START and GROUP_END also utilizes a multithreaded thread pool to invoke its methods for multi- map() threaded downloads.

So the whole program is finished, after the run can be found that the street beat the picture of the folder saved.

Finally, we give the code address for this section: Https://github.com/Python3WebSpider/Jiepai.

Sesame http: Analyzing ajax crawling to the top of today's headline street picture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.