Python crawler kuaishou video multi-thread download, python kuaishou
Start directly!
Environment: python 2.7 + win10
Tool: fiddler postman Android Simulator
First, open fiddler, and fiddler is used as an http/https packet capture artifact, which is not described here.
Allow https
Configure to allow remote connection, that is, enable http Proxy
Computer ip Address: 192.168.1.110
Then make sure that the mobile phone and the computer are in a LAN and can communicate with each other. Because I don't have an Android phone here, I use an android simulator instead, with the same effect.
Open the mobile browser and enter 192.168.1.110: 8888, that is, the configured proxy address. After the certificate is installed, packets can be captured.
After the certificate is installed, manually specify the http proxy in the WiFi settings Change Network
After saving, you can use fiddler to capture app data and refresh the app. You can see that many http requests come in. Generally, the interface address and the like are obvious, the data type is json.
In an http post request, the returned data is in json format. After expansion, a total of 20 videos are found. Check whether the information is correct and find a video link.
OK indicates that the video can be played very cleanly without watermarks.
Open postman to test. If the form-data method is submitted, an error is returned.
For raw
The error message is different. Add headers.
Nice successfully returned data. I tried it several times and found that each return results were different. They were all 20 videos, in the post parameter, a page = 1 is always the first page, just as if it had been on the mobile phone and started to refresh without turning it down. It doesn't matter anyway, as long as no duplicate data is returned.
The following code begins:
1 #-*-coding: UTF-8-*-2 # author: Corleone 3 import urllib2, urllib 4 import json, OS, re, socket, time, sys 5 import Queue 6 import threading 7 import logging 8 9 10 # Log Module 11 logger = logging. getLogger ("AppName") 12 formatter = logging. formatter ('% (asctime) s % (levelname)-5 s: % (message) s') 13 console_handler = logging. streamHandler (sys. stdout) 14 lele_handler.formatter = formatter 15 logger. addHand Ler (console_handler) 16 logger. setLevel (logging. INFO) 17 18 19 video_q = Queue. queue () # Video Queue 20 21 22 def get_video (): 23 url = "http: // 101.251.217.210/rest/n/feed/hot? App = 0 & lon = 121.372027 & c = BOYA_BAIDU_PINZHUAN & sys = ANDROID_4.1.2 & mod = HUAWEI (HUAWEI % 20C8813Q) & did = login & ver = 5.4 & net = WIFI & country_code = cn & iuid = & appver = 5.4.7.5559 & max_memory = 128 & oc = BOYA_BAIDU_PINZHUAN & ftt = & ud = 0 & language = zh-cn & lat = 31.319303 "24 data = {25 'type ': 7, 26 'page': 2, 27 'ldstart': 'false', 28 'Count': 20, 29 'pv': 'false', 30 'id ': 5, 31 'refreshtimes': 4, 32 'pcursor ': 1, 33' OS ': 'android', 34 'client _ key': '3c2cd3f3 ', 35 'sig ': '22769f2f5c0045381203fda-d1b5ad9b' 36} 37 req = urllib2.Request (url) 38 req. add_header ("User-Agent", "kwai-android") 39 req. add_header ("Content-Type", "application/x-www-form-urlencoded") 40 params = urllib. urlencode (data) 41 try: 42 html = urllib2.urlopen (req, params ). read () 43 records t urllib2.URLError: 44 logger. warning (u "Network instability is re-accessing") 45 html = urllib2.urlopen (req, params ). read () 46 result = json. loads (html) 47 reg = re. compile (u "[\ u4e00-\ u9fa5] +") # match only Chinese characters 48 for x in result ['feeds ']: 49 try: 50 title = x ['caption ']. replace ("\ n", "") 51 name = "". join (reg. findall (title) 52 video_q.put ([name, x ['photo _ id'], x ['main _ mv_urls '] [0] ['url']) 53 parse t KeyError: 54 pass 55 56 def download (video_q): 57 path = u "D: \ kuaishou" 58 while True: 59 data = video_q.get () 60 name = data [0]. replace ("\ n", "") 61 id = data [1] 62 url = data [2] 63 file = OS. path. join (path, name + ". mp4 ") 64 logger.info (u" is downloading: % s "% name) 65 try: 66 urllib. urlretrieve (url, file) 67 failed t IOError: 68 file = OS. path. join (path, u "neurology" + 'audio s.mp4 ') % id 69 try: 70 urllib. urlretrieve (url, file) 71 blocks T (socket. error, urllib. contentTooShortError): 72 logger. warning (u "request disconnected, sleep for 2 seconds") 73 time. sleep (2) 74 urllib. urlretrieve (url, file) 75 76 logger.info (u "download completed: % s" % name) 77 video_q.task_done () 78 79 80 def main (): 81 # Help 82 try: 83 threads = int (sys. argv [1]) 84 unique T (IndexError, ValueError): 85 print u "\ n usage:" + sys. argv [0] + u "[number of threads: 10] \ n" 86 print u "Example:" + sys. argv [0] + "10" + u "video crawling enable 10 threads to crawl about 2000 videos once a day (separated by spaces) "87 return False 88 # determine the directory 89 if OS. path. exists (u'd: \ kuaishou ') = False: 90 OS. makedirs (u 'd: \ kuaishou ') 91 # parse webpage 92 logger.info (u "crawling webpage") 93 for x in range (1,100 ): 94 logger.info (u "% s requests" % x) 95 get_video () 96 num = video_q.qsize () 97 logger.info (u "% s video" % num) 98 # multi-thread download 99 for y in range (threads): 100 t = threading. thread (target = download, args = (video_q,) 101 t. setDaemon (True) 102 t. start () 103 104 video_q.join () 105 logger.info (u "----------- all have been crawled ---------------") 106 107 108 main ()
Test
Multi-thread download: downloads about 2000 videos to D: \ kuaishou by default.
Conclusion: in fact, the kubernetes I crawled this time are a little opportunistic, because the signature of the post parameter sign is indeed encrypted, only the data can be returned because every time I request the same link, page = 1 is the first page. When I change to 2, the verification fails. However, it can return different data in this way. Although it achieves the effect, it cannot crack its encryption algorithm... This is also the case when the climbing occurs two days ago. Encrypted... Ah. Limited technology .. He cannot reverse his app... You can share it later ..
Finally put my github address: https://github.com/binglansky/spider is just registered, did not submit the code before, there are several other small crawlers, later find interesting and interesting will be submitted. You are welcome to learn, exchange, and play:) ~