Python crawler kuaishou video multi-thread download, python kuaishou

Source: Internet
Author: User

Python crawler kuaishou video multi-thread download, python kuaishou

Start directly!

 

Environment: python 2.7 + win10

Tool: fiddler postman Android Simulator

 

First, open fiddler, and fiddler is used as an http/https packet capture artifact, which is not described here.

Allow https

 

 

Configure to allow remote connection, that is, enable http Proxy

 

 

 

Computer ip Address: 192.168.1.110

Then make sure that the mobile phone and the computer are in a LAN and can communicate with each other. Because I don't have an Android phone here, I use an android simulator instead, with the same effect.

Open the mobile browser and enter 192.168.1.110: 8888, that is, the configured proxy address. After the certificate is installed, packets can be captured.

 

After the certificate is installed, manually specify the http proxy in the WiFi settings Change Network

 

 

After saving, you can use fiddler to capture app data and refresh the app. You can see that many http requests come in. Generally, the interface address and the like are obvious, the data type is json.

 

 

In an http post request, the returned data is in json format. After expansion, a total of 20 videos are found. Check whether the information is correct and find a video link.

 

OK indicates that the video can be played very cleanly without watermarks.

Open postman to test. If the form-data method is submitted, an error is returned.

 

For raw

 

 

The error message is different. Add headers.

 

 

 

Nice successfully returned data. I tried it several times and found that each return results were different. They were all 20 videos, in the post parameter, a page = 1 is always the first page, just as if it had been on the mobile phone and started to refresh without turning it down. It doesn't matter anyway, as long as no duplicate data is returned.

The following code begins:

 

1 #-*-coding: UTF-8-*-2 # author: Corleone 3 import urllib2, urllib 4 import json, OS, re, socket, time, sys 5 import Queue 6 import threading 7 import logging 8 9 10 # Log Module 11 logger = logging. getLogger ("AppName") 12 formatter = logging. formatter ('% (asctime) s % (levelname)-5 s: % (message) s') 13 console_handler = logging. streamHandler (sys. stdout) 14 lele_handler.formatter = formatter 15 logger. addHand Ler (console_handler) 16 logger. setLevel (logging. INFO) 17 18 19 video_q = Queue. queue () # Video Queue 20 21 22 def get_video (): 23 url = "http: // 101.251.217.210/rest/n/feed/hot? App = 0 & lon = 121.372027 & c = BOYA_BAIDU_PINZHUAN & sys = ANDROID_4.1.2 & mod = HUAWEI (HUAWEI % 20C8813Q) & did = login & ver = 5.4 & net = WIFI & country_code = cn & iuid = & appver = 5.4.7.5559 & max_memory = 128 & oc = BOYA_BAIDU_PINZHUAN & ftt = & ud = 0 & language = zh-cn & lat = 31.319303 "24 data = {25 'type ': 7, 26 'page': 2, 27 'ldstart': 'false', 28 'Count': 20, 29 'pv': 'false', 30 'id ': 5, 31 'refreshtimes': 4, 32 'pcursor ': 1, 33' OS ': 'android', 34 'client _ key': '3c2cd3f3 ', 35 'sig ': '22769f2f5c0045381203fda-d1b5ad9b' 36} 37 req = urllib2.Request (url) 38 req. add_header ("User-Agent", "kwai-android") 39 req. add_header ("Content-Type", "application/x-www-form-urlencoded") 40 params = urllib. urlencode (data) 41 try: 42 html = urllib2.urlopen (req, params ). read () 43 records t urllib2.URLError: 44 logger. warning (u "Network instability is re-accessing") 45 html = urllib2.urlopen (req, params ). read () 46 result = json. loads (html) 47 reg = re. compile (u "[\ u4e00-\ u9fa5] +") # match only Chinese characters 48 for x in result ['feeds ']: 49 try: 50 title = x ['caption ']. replace ("\ n", "") 51 name = "". join (reg. findall (title) 52 video_q.put ([name, x ['photo _ id'], x ['main _ mv_urls '] [0] ['url']) 53 parse t KeyError: 54 pass 55 56 def download (video_q): 57 path = u "D: \ kuaishou" 58 while True: 59 data = video_q.get () 60 name = data [0]. replace ("\ n", "") 61 id = data [1] 62 url = data [2] 63 file = OS. path. join (path, name + ". mp4 ") 64 logger.info (u" is downloading: % s "% name) 65 try: 66 urllib. urlretrieve (url, file) 67 failed t IOError: 68 file = OS. path. join (path, u "neurology" + 'audio s.mp4 ') % id 69 try: 70 urllib. urlretrieve (url, file) 71 blocks T (socket. error, urllib. contentTooShortError): 72 logger. warning (u "request disconnected, sleep for 2 seconds") 73 time. sleep (2) 74 urllib. urlretrieve (url, file) 75 76 logger.info (u "download completed: % s" % name) 77 video_q.task_done () 78 79 80 def main (): 81 # Help 82 try: 83 threads = int (sys. argv [1]) 84 unique T (IndexError, ValueError): 85 print u "\ n usage:" + sys. argv [0] + u "[number of threads: 10] \ n" 86 print u "Example:" + sys. argv [0] + "10" + u "video crawling enable 10 threads to crawl about 2000 videos once a day (separated by spaces) "87 return False 88 # determine the directory 89 if OS. path. exists (u'd: \ kuaishou ') = False: 90 OS. makedirs (u 'd: \ kuaishou ') 91 # parse webpage 92 logger.info (u "crawling webpage") 93 for x in range (1,100 ): 94 logger.info (u "% s requests" % x) 95 get_video () 96 num = video_q.qsize () 97 logger.info (u "% s video" % num) 98 # multi-thread download 99 for y in range (threads): 100 t = threading. thread (target = download, args = (video_q,) 101 t. setDaemon (True) 102 t. start () 103 104 video_q.join () 105 logger.info (u "----------- all have been crawled ---------------") 106 107 108 main ()

 

Test

 

Multi-thread download: downloads about 2000 videos to D: \ kuaishou by default.

 

 

Conclusion: in fact, the kubernetes I crawled this time are a little opportunistic, because the signature of the post parameter sign is indeed encrypted, only the data can be returned because every time I request the same link, page = 1 is the first page. When I change to 2, the verification fails. However, it can return different data in this way. Although it achieves the effect, it cannot crack its encryption algorithm... This is also the case when the climbing occurs two days ago. Encrypted... Ah. Limited technology .. He cannot reverse his app... You can share it later ..

Finally put my github address: https://github.com/binglansky/spider is just registered, did not submit the code before, there are several other small crawlers, later find interesting and interesting will be submitted. You are welcome to learn, exchange, and play:) ~

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.