Video capture principle: Get all the Knowledge category id-and then get its children-"based on the sub-link analysis to get the number of courses-" loop to get links to the video.
Need to install Python library: Requests
The Python parsing XML uses the code found on the Web.
Could have been optimized again. But lazy!
1 #Coding:utf-82 ImportOS3 ImportSYS4 ImportRequests5 ImportUrllib.request,io6 fromHtml.parserImportHtmlparser7 8 #Global Variables9 TenId_list = set ()#Save a list of video IDs OneId_dict = {}#Save ID and number of corresponding sub-videos Acookies = {}#Save Cookies - - #HTML parsing class the classMyhtmlparser (htmlparser): - def __init__(self, Key, attr): -Htmlparser.__init__(self) -Self.links = [] +Self.keys =Key -Self.attr =attr + defHandle_starttag (self, Tag, attrs): A #print "Encountered the beginning of a%s tag"% tag at #if tag = = "Source": - ifTag = =Self.keys: - ifLen (attrs) = =0: - Pass - Else: - for(variable, value)inchAttrs: in #if variable = = "src": - ifVariable = =self.attr: to self.links.append (value) + - the #Parsing Cookies Dictionary * defgetcookies (COOKIES_STR): $ GlobalCookiesPanax Notoginseng forLineinchCookiesstr.split (';'): - #It's set to 1 to split the string into 2 copies . theName, value = Line.strip (). Split ('=', 1) +Cookies[name] =value A the defgethtml (URL, key, value): + GlobalCookies -r = Requests.get (URL, cookies=cookies) $Content = R.content.decode ('UTF-8') $HP = Myhtmlparser ("Source","src") - hp.feed (content) - hp.close () the Print(hp.links) - forLinkinchhp.links:WuyiLink_str =str (link) the ifLink_str.find (". mp4") >=0: - downloadFile (link, key, value) Wu Else: - Print("no corresponding video found") About $ - #get the number of courses - defgetcoursenum (URL): - GlobalCookies AUrl_list =set () +r = Requests.get (URL, cookies=cookies) theContent = R.content.decode ('UTF-8') -HP = Myhtmlparser ("a","href") $ hp.feed (content) the hp.close () the forLinkinchhp.links: theLink_str =str (link) the ifLink_str.find ("http://www.jikexueyuan.com/course/") >= 0 andLink_str.find (". Html?ss=1") >=0: - Url_list.add (LINK_STR) in returnUrl_list.__len__() the the #get all video IDs, according to the catalog page About defgetidlist (Root): the GlobalCookies theR = Requests.get (root, cookies=cookies) theContent = R.content.decode ('UTF-8') +HP = Myhtmlparser ("a","href") - hp.feed (content) the hp.close ()Bayi #print (hp.links) the #The declaration refers to the global id_list, defined at the top the Globalid_list - Globalid_dict - the forLinkinchhp.links: theLink_str =str (link) the ifLink_str.find ("http://www.jikexueyuan.com/course/") >= 0 andLink_str.find (". html") >=0: the #print (link) -c_id = Link_str.lstrip ("http://www.jikexueyuan.com/course/"). Rstrip (". html") the ifc_id not inchid_list: theID_DICT[C_ID] =getcoursenum (LINK_STR) the Print(c_id, id_dict[c_id])94 Id_list.add (c_id) the Print(id_dict) the the defdownloadFile (URL, key, value):98 #url = ' http://cv4.jikexueyuan.com/10de45bbf83e450ff5e11ff4599d7166/201603202253/cocos2d-x/course_712/01/ Video/c712b_01_h264_sd_960_540.mp4 ' AboutR =requests.get (URL) -file_name = str (key) +"_"+str (value) +". mp4"101With open (file_name,"WB") as code:102 code.write (r.content)103 104 if __name__=="__main__": theCount =0106 #Parse Cookies use free time to download the required video, the required account of the cookies107Cookiesstr ="can be obtained via Google Chrome"108 getcookies (COOKIESSTR)109 the 111Root ="http://ke.jikexueyuan.com/xilie/331?huodong=shequn_0307" the getidlist (Root)113 theHead ="http://www.jikexueyuan.com/course/" the the forKeyinchid_dict:117 ifId_dict[key] <=0:118 Print(Id_dict[key],"No Data")119 Break - forIinchRange (1, id_dict[key]+1):121URL = head+key+"_"+str (i) +". Html?ss=1"122 Print("Download:")123 Print(URL)124Count + = 1 the gethtml (URL, key, i)126 Print("Total Videos:")127 Print(count)
Points that can be optimized: it doesn't look good because you don't get the name of each video. You can get the name of the video and then create a folder based on the category. This saves it for easier viewing.
Cookies are really available for direct use. This means that if you intercept the user's browser login information can also be directly logged in and get useful information. Which hackers get the cookies and then steal the user information principle is this it? Interesting.
Python crawls a college video