Python crawler kuaishou video multi-thread download, python kuaishou

Last Update:2018-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Start directly!

Environment: python 2.7 + win10

Tool: fiddler postman Android Simulator

First, open fiddler, and fiddler is used as an http/https packet capture artifact, which is not described here.

Allow https

Configure to allow remote connection, that is, enable http Proxy

Computer ip Address: 192.168.1.110

Then make sure that the mobile phone and the computer are in a LAN and can communicate with each other. Because I don't have an Android phone here, I use an android simulator instead, with the same effect.

Open the mobile browser and enter 192.168.1.110: 8888, that is, the configured proxy address. After the certificate is installed, packets can be captured.

After the certificate is installed, manually specify the http proxy in the WiFi settings Change Network

After saving, you can use fiddler to capture app data and refresh the app. You can see that many http requests come in. Generally, the interface address and the like are obvious, the data type is json.

In an http post request, the returned data is in json format. After expansion, a total of 20 videos are found. Check whether the information is correct and find a video link.

OK indicates that the video can be played very cleanly without watermarks.

Open postman to test. If the form-data method is submitted, an error is returned.

For raw

The error message is different. Add headers.

Nice successfully returned data. I tried it several times and found that each return results were different. They were all 20 videos, in the post parameter, a page = 1 is always the first page, just as if it had been on the mobile phone and started to refresh without turning it down. It doesn't matter anyway, as long as no duplicate data is returned.

The following code begins:

1 #-*-coding: UTF-8-*-2 # author: Corleone 3 import urllib2, urllib 4 import json, OS, re, socket, time, sys 5 import Queue 6 import threading 7 import logging 8 9 10 # Log Module 11 logger = logging. getLogger ("AppName") 12 formatter = logging. formatter ('% (asctime) s % (levelname)-5 s: % (message) s') 13 console_handler = logging. streamHandler (sys. stdout) 14 lele_handler.formatter = formatter 15 logger. addHand Ler (console_handler) 16 logger. setLevel (logging. INFO) 17 18 19 video_q = Queue. queue () # Video Queue 20 21 22 def get_video (): 23 url = "http: // 101.251.217.210/rest/n/feed/hot? App = 0 & lon = 121.372027 & c = BOYA_BAIDU_PINZHUAN & sys = ANDROID_4.1.2 & mod = HUAWEI (HUAWEI % 20C8813Q) & did = login & ver = 5.4 & net = WIFI & country_code = cn & iuid = & appver = 5.4.7.5559 & max_memory = 128 & oc = BOYA_BAIDU_PINZHUAN & ftt = & ud = 0 & language = zh-cn & lat = 31.319303 "24 data = {25 'type ': 7, 26 'page': 2, 27 'ldstart': 'false', 28 'Count': 20, 29 'pv': 'false', 30 'id ': 5, 31 'refreshtimes': 4, 32 'pcursor ': 1, 33' OS ': 'android', 34 'client _ key': '3c2cd3f3 ', 35 'sig ': '22769f2f5c0045381203fda-d1b5ad9b' 36} 37 req = urllib2.Request (url) 38 req. add_header ("User-Agent", "kwai-android") 39 req. add_header ("Content-Type", "application/x-www-form-urlencoded") 40 params = urllib. urlencode (data) 41 try: 42 html = urllib2.urlopen (req, params ). read () 43 records t urllib2.URLError: 44 logger. warning (u "Network instability is re-accessing") 45 html = urllib2.urlopen (req, params ). read () 46 result = json. loads (html) 47 reg = re. compile (u "[\ u4e00-\ u9fa5] +") # match only Chinese characters 48 for x in result ['feeds ']: 49 try: 50 title = x ['caption ']. replace ("\ n", "") 51 name = "". join (reg. findall (title) 52 video_q.put ([name, x ['photo _ id'], x ['main _ mv_urls '] [0] ['url']) 53 parse t KeyError: 54 pass 55 56 def download (video_q): 57 path = u "D: \ kuaishou" 58 while True: 59 data = video_q.get () 60 name = data [0]. replace ("\ n", "") 61 id = data [1] 62 url = data [2] 63 file = OS. path. join (path, name + ". mp4 ") 64 logger.info (u" is downloading: % s "% name) 65 try: 66 urllib. urlretrieve (url, file) 67 failed t IOError: 68 file = OS. path. join (path, u "neurology" + 'audio s.mp4 ') % id 69 try: 70 urllib. urlretrieve (url, file) 71 blocks T (socket. error, urllib. contentTooShortError): 72 logger. warning (u "request disconnected, sleep for 2 seconds") 73 time. sleep (2) 74 urllib. urlretrieve (url, file) 75 76 logger.info (u "download completed: % s" % name) 77 video_q.task_done () 78 79 80 def main (): 81 # Help 82 try: 83 threads = int (sys. argv [1]) 84 unique T (IndexError, ValueError): 85 print u "\ n usage:" + sys. argv [0] + u "[number of threads: 10] \ n" 86 print u "Example:" + sys. argv [0] + "10" + u "video crawling enable 10 threads to crawl about 2000 videos once a day (separated by spaces) "87 return False 88 # determine the directory 89 if OS. path. exists (u'd: \ kuaishou ') = False: 90 OS. makedirs (u 'd: \ kuaishou ') 91 # parse webpage 92 logger.info (u "crawling webpage") 93 for x in range (1,100 ): 94 logger.info (u "% s requests" % x) 95 get_video () 96 num = video_q.qsize () 97 logger.info (u "% s video" % num) 98 # multi-thread download 99 for y in range (threads): 100 t = threading. thread (target = download, args = (video_q,) 101 t. setDaemon (True) 102 t. start () 103 104 video_q.join () 105 logger.info (u "----------- all have been crawled ---------------") 106 107 108 main ()

Test

Multi-thread download: downloads about 2000 videos to D: \ kuaishou by default.

Conclusion: in fact, the kubernetes I crawled this time are a little opportunistic, because the signature of the post parameter sign is indeed encrypted, only the data can be returned because every time I request the same link, page = 1 is the first page. When I change to 2, the verification fails. However, it can return different data in this way. Although it achieves the effect, it cannot crack its encryption algorithm... This is also the case when the climbing occurs two days ago. Encrypted... Ah. Limited technology .. He cannot reverse his app... You can share it later ..

Finally put my github address: https://github.com/binglansky/spider is just registered, did not submit the code before, there are several other small crawlers, later find interesting and interesting will be submitted. You are welcome to learn, exchange, and play:) ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler kuaishou video multi-thread download, python kuaishou

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler kuaishou video multi-thread download, python kuaishou

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support