Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today

Last Update:2018-03-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this section, we take toutiao.com as an example to try to capture webpage data by analyzing Ajax requests. The goal of this capture is today's toutiao.com. After the capture is complete, download each group of images to a local folder and save them.

1. Preparations

Before starting this section, make sure that you have installed the requests library.

2. Practical drills

First, the implementation methodget_page()To load the results of a single Ajax request. The only parameter changed isoffsetSo we pass it as a parameter and implement it as follows:

Import requestsfrom urllib. parse import urlencodedef get_page (offset): params = {'offset': offset, 'format': 'json', 'keyword': 'street shooting ', 'autoload ': 'true', 'Count': '20', 'cur _ tab': '1',} url = 'HTTP: // www.toutiao.com/search_content /? '+ Urlencode (params) try: response = requests. get (url) if response. status_code = 200: return response. json () response t requests. ConnectionError: return None

Here we useurlencode()Method to Construct the GET parameter of the request, and then request this link using requests. If the return status code is 200, callresponseOfjson()Method to convert the result to JSON format, and then return.

Next, we will implement a parsing method: extractimage_detailEach image link in the field returns the image link along with the title of the image. In this case, a generator can be constructed. The implementation code is as follows:

def get_images(json):    if json.get('data'):        for item in json.get('data'):            title = item.get('title')            images = item.get('image_detail')            for image in images:                yield {                    'image': image.get('url'),                    'title': title                }

Next, we will implement a method for saving images.save_image(), WhereitemAboveget_images()A dictionary returned by the method. In this methoditemOftitleTo create a folder, and then request the image link to obtain the binary data of the image, and write the file in binary form. You can use the MD5 value of the image content to remove duplicates. The related code is as follows:

import osfrom hashlib import md5def save_image(item):    if not os.path.exists(item.get('title')):        os.mkdir(item.get('title'))    try:        response = requests.get(item.get('image'))        if response.status_code == 200:            file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')            if not os.path.exists(file_path):                with open(file_path, 'wb') as f:                    f.write(response.content)            else:                print('Already Downloaded', file_path)    except requests.ConnectionError:        print('Failed to Save Image')

Finally, you only need to constructoffsetArray, Traversaloffset, Extract the image link and download it:

from multiprocessing.pool import Pooldef main(offset):    json = get_page(offset)    for item in get_images(json):        print(item)        save_image(item)GROUP_START = 1GROUP_END = 20if __name__ == '__main__':    pool = Pool()    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])    pool.map(main, groups)    pool.close()    pool.join()

The start and end pages of a page are definedGROUP_STARTAndGROUP_ENDAnd uses the thread pool of multiple threads to callmap()Method to implement multi-threaded download.

In this way, the entire program is completed. After running the program, you can find that all the street photos are saved in folders ,.

Finally, we provide the Code address for this section: https://github.com/python3webspider/jie.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sesame HTTP: Analysis of Ajax crawls toutiao.com's street photos, ajax today

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support