Python crawls a college video

Source: Internet
Author: User


Video capture principle: Get all the Knowledge category id-and then get its children-"based on the sub-link analysis to get the number of courses-" loop to get links to the video.
Need to install Python library: Requests
The Python parsing XML uses the code found on the Web.

Could have been optimized again. But lazy!



1 #Coding:utf-82 ImportOS3 ImportSYS4 ImportRequests5 ImportUrllib.request,io6  fromHtml.parserImportHtmlparser7 8 #Global Variables9 TenId_list = set ()#Save a list of video IDs OneId_dict = {}#Save ID and number of corresponding sub-videos Acookies = {}#Save Cookies -  - #HTML parsing class the classMyhtmlparser (htmlparser): -     def __init__(self, Key, attr): -Htmlparser.__init__(self) -Self.links = [] +Self.keys =Key -Self.attr =attr +     defHandle_starttag (self, Tag, attrs): A         #print "Encountered the beginning of a%s tag"% tag at         #if tag = = "Source": -         ifTag = =Self.keys: -             ifLen (attrs) = =0: -                 Pass -             Else: -                  for(variable, value)inchAttrs: in                     #if variable = = "src": -                     ifVariable = =self.attr: to self.links.append (value) +  -  the #Parsing Cookies Dictionary * defgetcookies (COOKIES_STR): $     GlobalCookiesPanax Notoginseng      forLineinchCookiesstr.split (';'): -         #It's set to 1 to split the string into 2 copies . theName, value = Line.strip (). Split ('=', 1) +Cookies[name] =value A  the defgethtml (URL, key, value): +     GlobalCookies -r = Requests.get (URL, cookies=cookies) $Content = R.content.decode ('UTF-8') $HP = Myhtmlparser ("Source","src") - hp.feed (content) - hp.close () the     Print(hp.links) -      forLinkinchhp.links:WuyiLink_str =str (link) the         ifLink_str.find (". mp4") >=0: - downloadFile (link, key, value) Wu         Else: -            Print("no corresponding video found") About  $  - #get the number of courses - defgetcoursenum (URL): -     GlobalCookies AUrl_list =set () +r = Requests.get (URL, cookies=cookies) theContent = R.content.decode ('UTF-8') -HP = Myhtmlparser ("a","href") $ hp.feed (content) the hp.close () the      forLinkinchhp.links: theLink_str =str (link) the         ifLink_str.find ("http://www.jikexueyuan.com/course/") >= 0 andLink_str.find (". Html?ss=1") >=0: - Url_list.add (LINK_STR) in     returnUrl_list.__len__() the  the #get all video IDs, according to the catalog page About defgetidlist (Root): the     GlobalCookies theR = Requests.get (root, cookies=cookies) theContent = R.content.decode ('UTF-8') +HP = Myhtmlparser ("a","href") - hp.feed (content) the hp.close ()Bayi     #print (hp.links) the     #The declaration refers to the global id_list, defined at the top the     Globalid_list -     Globalid_dict -  the      forLinkinchhp.links: theLink_str =str (link) the         ifLink_str.find ("http://www.jikexueyuan.com/course/") >= 0 andLink_str.find (". html") >=0: the             #print (link) -c_id = Link_str.lstrip ("http://www.jikexueyuan.com/course/"). Rstrip (". html") the             ifc_id not inchid_list: theID_DICT[C_ID] =getcoursenum (LINK_STR) the                 Print(c_id, id_dict[c_id])94 Id_list.add (c_id) the     Print(id_dict) the  the defdownloadFile (URL, key, value):98     #url = ' http://cv4.jikexueyuan.com/10de45bbf83e450ff5e11ff4599d7166/201603202253/cocos2d-x/course_712/01/ Video/c712b_01_h264_sd_960_540.mp4 ' AboutR =requests.get (URL) -file_name = str (key) +"_"+str (value) +". mp4"101With open (file_name,"WB") as code:102 code.write (r.content)103 104 if __name__=="__main__": theCount =0106     #Parse Cookies use free time to download the required video, the required account of the cookies107Cookiesstr ="can be obtained via Google Chrome"108 getcookies (COOKIESSTR)109  the 111Root ="http://ke.jikexueyuan.com/xilie/331?huodong=shequn_0307" the getidlist (Root)113  theHead ="http://www.jikexueyuan.com/course/" the  the      forKeyinchid_dict:117         ifId_dict[key] <=0:118             Print(Id_dict[key],"No Data")119              Break -          forIinchRange (1, id_dict[key]+1):121URL = head+key+"_"+str (i) +". Html?ss=1"122             Print("Download:")123             Print(URL)124Count + = 1 the gethtml (URL, key, i)126     Print("Total Videos:")127     Print(count)

Points that can be optimized: it doesn't look good because you don't get the name of each video. You can get the name of the video and then create a folder based on the category. This saves it for easier viewing.

Cookies are really available for direct use. This means that if you intercept the user's browser login information can also be directly logged in and get useful information. Which hackers get the cookies and then steal the user information principle is this it? Interesting.




Python crawls a college video

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.