Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today
In this section, we take toutiao.com as an example to try to capture webpage data by analyzing Ajax requests. The goal of this capture is today's toutiao.com. After the capture is complete, download each group of images to a local folder and save them.
1. Preparations
Before starting this section, make sure that you have installed the requests library.
2. Practical drills
First, the implementation methodget_page()
To load the results of a single Ajax request. The only parameter changed isoffset
So we pass it as a parameter and implement it as follows:
Import requestsfrom urllib. parse import urlencodedef get_page (offset): params = {'offset': offset, 'format': 'json', 'keyword': 'street shooting ', 'autoload ': 'true', 'Count': '20', 'cur _ tab': '1',} url = 'HTTP: // www.toutiao.com/search_content /? '+ Urlencode (params) try: response = requests. get (url) if response. status_code = 200: return response. json () response t requests. ConnectionError: return None
Here we useurlencode()
Method to Construct the GET parameter of the request, and then request this link using requests. If the return status code is 200, callresponse
Ofjson()
Method to convert the result to JSON format, and then return.
Next, we will implement a parsing method: extractimage_detail
Each image link in the field returns the image link along with the title of the image. In this case, a generator can be constructed. The implementation code is as follows:
def get_images(json): if json.get('data'): for item in json.get('data'): title = item.get('title') images = item.get('image_detail') for image in images: yield { 'image': image.get('url'), 'title': title }
Next, we will implement a method for saving images.save_image()
, Whereitem
Aboveget_images()
A dictionary returned by the method. In this methoditem
Oftitle
To create a folder, and then request the image link to obtain the binary data of the image, and write the file in binary form. You can use the MD5 value of the image content to remove duplicates. The related code is as follows:
import osfrom hashlib import md5def save_image(item): if not os.path.exists(item.get('title')): os.mkdir(item.get('title')) try: response = requests.get(item.get('image')) if response.status_code == 200: file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg') if not os.path.exists(file_path): with open(file_path, 'wb') as f: f.write(response.content) else: print('Already Downloaded', file_path) except requests.ConnectionError: print('Failed to Save Image')
Finally, you only need to constructoffset
Array, Traversaloffset
, Extract the image link and download it:
from multiprocessing.pool import Pooldef main(offset): json = get_page(offset) for item in get_images(json): print(item) save_image(item)GROUP_START = 1GROUP_END = 20if __name__ == '__main__': pool = Pool() groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)]) pool.map(main, groups) pool.close() pool.join()
The start and end pages of a page are definedGROUP_START
AndGROUP_END
And uses the thread pool of multiple threads to callmap()
Method to implement multi-threaded download.
In this way, the entire program is completed. After running the program, you can find that all the street photos are saved in folders ,.
Finally, we provide the Code address for this section: https://github.com/python3webspider/jie.