In this section, let's take the example of today's headlines to try to crawl Web page data by analyzing AJAX requests. The goal of this crawl is to get a picture of the street in the headlines today, and after the crawl is complete, download each group of pictures to the local folder and save it.
1. Preparatory work
Before you begin this section, make sure that you have installed the requests library.
2. Practical Walkthrough
First, implement get_page()
the method to load the results of a single AJAX request. The only change is the parameter offset
, so we pass it as a parameter, implemented as follows:
Import Requests fromurllib.parse Import urlencodedef get_page (offset):params= { 'Offset': Offset,'format':'JSON', 'keyword':'Street Pat', 'AutoLoad':'true', 'Count':' -', 'Cur_tab':'1',} URL='http://www.toutiao.com/search_content/?'+ UrlEncode (params) Try: Response= requests.Get(URL)ifResponse.status_code = = $: returnResponse.json () except requests. Connectionerror:returnNone
Here we use the urlencode()
method to construct the requested get parameter, and then request the link with requests, if the return status code is 200, then response
the method called will convert the result to json()
JSON format, and then return.
Next, implement an analytic method: Extract each image_detail
picture link in the field of each data, return the picture link and the title of the picture, and construct a generator. The implementation code is as follows:
def get_images (JSON):ifJson.Get('Data'): forIteminchJson.Get('Data'): Title= Item.Get('title') Images= Item.Get('Image_detail') forImageinchImages:yield { 'Image': Image.Get('URL'), 'title': Title}
Next, implement a method for saving the picture save_image()
, which item
is get_images()
a dictionary returned by the previous method. In this method, based on item
the first title
to create a folder, and then request the image link, get the binary data of the picture, write the file in binary form. The name of the picture can use the MD5 value of its content, which removes duplicates. The relevant code is as follows:
Import OS fromhashlib Import md5def save_image (item):ifNot os.path.exists (item.Get('title'): Os.mkdir (item.Get('title')) Try: Response= requests.Get(item.Get('Image')) ifResponse.status_code = = $: File_path='{0}/{1}. {2}'. Format (item.Get('title'), MD5 (response.content). Hexdigest (),'jpg') ifNot os.path.exists (file_path): With open (File_path,'WB') asf:f.write (response.content)Else: Print ('already downloaded', File_path) except requests. Connectionerror:print ('Failed to Save Image')
Finally, you only need to construct an offset
array, traverse offset
, extract the image link, and download it:
fromMultiprocessing.pool Import pooldef Main (offset): JSON=get_page (offset) forIteminchget_images (JSON): Print (item) save_image (item) Group_start=1Group_end= -if__name__ = ='__main__': Pool=Pool () groups= ([x * - forXinchRange (Group_start, Group_end +1)]) Pool.map (main, groups) Pool.close () Pool.join ()
This defines the starting and ending pages of pagination, respectively, GROUP_START
and GROUP_END
also utilizes a multithreaded thread pool to invoke its methods for multi- map()
threaded downloads.
So the whole program is finished, after the run can be found that the street beat the picture of the folder saved.
Finally, we give the code address for this section: Https://github.com/Python3WebSpider/Jiepai.
Sesame http: Analyzing ajax crawling to the top of today's headline street picture