Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today

Source: Internet
Author: User

Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today

In this section, we take toutiao.com as an example to try to capture webpage data by analyzing Ajax requests. The goal of this capture is today's toutiao.com. After the capture is complete, download each group of images to a local folder and save them.

1. Preparations

Before starting this section, make sure that you have installed the requests library.

2. Practical drills

First, the implementation methodget_page()To load the results of a single Ajax request. The only parameter changed isoffsetSo we pass it as a parameter and implement it as follows:

Import requestsfrom urllib. parse import urlencodedef get_page (offset): params = {'offset': offset, 'format': 'json', 'keyword': 'street shooting ', 'autoload ': 'true', 'Count': '20', 'cur _ tab': '1',} url = 'HTTP: // www.toutiao.com/search_content /? '+ Urlencode (params) try: response = requests. get (url) if response. status_code = 200: return response. json () response t requests. ConnectionError: return None

Here we useurlencode()Method to Construct the GET parameter of the request, and then request this link using requests. If the return status code is 200, callresponseOfjson()Method to convert the result to JSON format, and then return.

Next, we will implement a parsing method: extractimage_detailEach image link in the field returns the image link along with the title of the image. In this case, a generator can be constructed. The implementation code is as follows:

def get_images(json):    if json.get('data'):        for item in json.get('data'):            title = item.get('title')            images = item.get('image_detail')            for image in images:                yield {                    'image': image.get('url'),                    'title': title                }

Next, we will implement a method for saving images.save_image(), WhereitemAboveget_images()A dictionary returned by the method. In this methoditemOftitleTo create a folder, and then request the image link to obtain the binary data of the image, and write the file in binary form. You can use the MD5 value of the image content to remove duplicates. The related code is as follows:

import osfrom hashlib import md5def save_image(item):    if not os.path.exists(item.get('title')):        os.mkdir(item.get('title'))    try:        response = requests.get(item.get('image'))        if response.status_code == 200:            file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')            if not os.path.exists(file_path):                with open(file_path, 'wb') as f:                    f.write(response.content)            else:                print('Already Downloaded', file_path)    except requests.ConnectionError:        print('Failed to Save Image')

Finally, you only need to constructoffsetArray, Traversaloffset, Extract the image link and download it:

from multiprocessing.pool import Pooldef main(offset):    json = get_page(offset)    for item in get_images(json):        print(item)        save_image(item)GROUP_START = 1GROUP_END = 20if __name__ == '__main__':    pool = Pool()    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])    pool.map(main, groups)    pool.close()    pool.join()

The start and end pages of a page are definedGROUP_STARTAndGROUP_ENDAnd uses the thread pool of multiple threads to callmap()Method to implement multi-threaded download.

In this way, the entire program is completed. After running the program, you can find that all the street photos are saved in folders ,.

Finally, we provide the Code address for this section: https://github.com/python3webspider/jie.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.