Crawling app photos with Python

Source: Internet
Author: User
Tags urlencode

First download a bucket of fish (not download can also, the URL is here, right?)

Grab the packet, crawl to a JSON packet, get the address below

Observation test shows that by modifying the value of offset is equivalent to the page of the app

To access this URL, the return is a large dictionary, a dictionary of two indexes, an error, a data. and data is an array of length 20, and each array is a dictionary. There is another index in each dictionary, vertical_src.

It's our goal!

1 Import Urllib.parse2 Import Urllib3 Import JSON4 Import Urllib.request5data_info={}6data_info['type']='AUTO'7data_info['DOCTYPE']='JSON'8data_info['xmlversion']='1.6'9data_info['UE']='UTF-8'Tendata_info['Typoresult']='true' Onehead_info={} Ahead_info['user-agent']='dyzb/2.271 (iphone; IOS 9.3.2; scale/3.00)' -Url='http://capi.douyucdn.cn/api/v1/getVerticalRoom?aid=ios&client_sys=ios&limit=20&offset=20' -Data_info=urllib.parse.urlencode (Data_info). Encode ('Utf-8') the print (data_info) -requ=urllib.request.Request (url,data_info) -Requ.add_header ('Referer','http://capi.douyucdn.cn') -Requ.add_header ('user-agent','dyzb/2.271 (iphone; IOS 9.3.2; scale/3.00)') +Response=Urllib.request.urlopen (requ) - Print (response) +Html=response.read (). Decode ('Utf-8')

This just over 20 lines of code will return the JSON data. Then, by slicing the JSON code, the URL address of each host photo is separated.

And get a picture of this page.

1 ImportJSON2 Importurllib.request3data_info={}4data_info['type']='AUTO'5data_info['DOCTYPE']='JSON'6data_info['xmlversion']='1.6'7data_info['UE']='UTF-8'8data_info['Typoresult']='true'9 Ten OneURL+STR (i) ='http://capi.douyucdn.cn/api/v1/getVerticalRoom?aid=ios&client_sys=ios&limit=20&offset='+str (x) AData_info=urllib.parse.urlencode (Data_info). Encode ('Utf-8') - Print(Data_info) -requ=urllib.request.Request (url,data_info) theRequ.add_header ('Referer','http://capi.douyucdn.cn') -Requ.add_header ('user-agent','dyzb/2.271 (iphone; IOS 9.3.2; scale/3.00)') -Response=Urllib.request.urlopen (requ) - Print(response) +Html=response.read (). Decode ('Utf-8') - " " + Print (Type (dictionary)) A print (Type (Dictionary[data])) at " " -dictionary=json.loads (HTML) -data_arr=dictionary["Data"] -  forIinchRange (0,19): -name=data_arr[i]["Nickname"] -img_url=data_arr[i]["vertical_src"] in    Print(Type (img_url)) -respon_tem=Urllib.request.urlopen (Img_url) toanchor_img=Respon_tem.read () +With open ('.. /photos/'+name+'. jpg','WB') as F: -F.write (ANCHOR_IMG)

Then modify it so that it has the ability to turn the page

1 ImportUrllib.parse2 ImportUrllib3 ImportJSON4 Importurllib.request5data_info={}6data_info['type']='AUTO'7data_info['DOCTYPE']='JSON'8data_info['xmlversion']='1.6'9data_info['UE']='UTF-8'Tendata_info['Typoresult']='true' OneData_info=urllib.parse.urlencode (Data_info). Encode ('Utf-8') A  -  forXinchRange (0,195): -Url='http://capi.douyucdn.cn/api/v1/getVerticalRoom?aid=ios&client_sys=ios&limit=20&offset='+str (x) the     Print(Data_info) -requ=urllib.request.Request (url,data_info) -Requ.add_header ('Referer','http://capi.douyucdn.cn') -Requ.add_header ('user-agent','dyzb/2.271 (iphone; IOS 9.3.2; scale/3.00)') +Response=Urllib.request.urlopen (requ) -     Print(response) +Html=response.read (). Decode ('Utf-8') Adictionary=json.loads (HTML) atdata_arr=dictionary["Data"] -      forIinchRange (0,19): -name=data_arr[i]["Nickname"] -img_url=data_arr[i]["vertical_src"] -         Print(Type (img_url)) -respon_tem=Urllib.request.urlopen (Img_url) inanchor_img=Respon_tem.read () -With open ('.. /photos/'+name+'. jpg','WB') as F: toF.write (ANCHOR_IMG)

And then just wait.

It's best to set the time, how often you crawl, or how often you change the IP. It's all right.

Crawling app photos with Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.