Geek college career path graph course video download-crawler, video download Crawler
I. Preface
I recently read the video tutorial from geek College, which is quite good and eager to download the video to my local computer. Manual download is time-consuming, so I decided to study it and write a program to automatically download it! See the figure below:
Ii. Technical difficulties
To enable automatic download, you must crawl the geek college page to obtain the video address. After trying, we can find that most of the video views require logon and membership. The main points to be solved are:
1. Have a member
2. Simulate Logon
3. parse html
4. Continue downloading (in case of exceptions)
Iii. Implementation
(1) It is quite easy to get members from geek College (if you are already a member user, skip this step) and invite friends for a 30-day experience, example: (30 days is enough to download all the videos)
(2) The most difficult problem is simulated logon. Careful people will find that geek College provides an interface for ajax login! For example:
Http://passport.jikexueyuan.com/submit/login? Is_ajax = 1, Verification Code address: http://passport.jikexueyuan.com/sso/verify
"Everything is ready for failure". I know the address. Are you afraid you will not be able to log on? The login result is as follows:
1 /// <summary> 2 // obtain the verification code 3 /// </summary> 4 void GetVerifyCode () 5 {6 VerifyCode. source = null; 7 8 // 1. first, request the logon address to obtain the current user's session cookie 9 HttpResponseParameter responseParameter = HttpProvider. excute (new HttpRequestParameter10 {11 IsPost = false, 12 Url =" http://passport.jikexueyuan.com/sso/login "13}); 14 SessionCookie = responseParameter. cookie; 15 16 // 2. 17 HttpProvider. excute (new HttpRequestParameter18 {19 Url = string. format (" http://passport.jikexueyuan.com/sso/verify?t= {0} ", DateTime. now. toString ("yyyyMMddHHmmsss"), 20 Cookie = SessionCookie, 21 ResponseEnum = HttpResponseEnum. stream, 22 StreamAction = x => 23 {24 MemoryStream MS = new MemoryStream (); 25 byte [] buffer = new byte [1024]; 26 while (true) 27 {28 int sz = x. read (buffer, 0, buffer. length); 29 if (sz = 0) break; 30 ms. write (buffer, 0, sz); 31} 32 ms. position = 0; 33 34 BitmapImage bmp = new BitmapImage (); 35 bmp. beginInit (); 36 bmp. streamSource = new MemoryStream (ms. toArray (); 37 bmp. endInit (); 38 39 VerifyCode. source = bmp; 40 41 ms. close (); 42} 43}); 44}Obtain the verification code 1 // <summary> 2 // logon method 3 // </summary> 4 // <param name = "userName"> account </param> 5 // <param name = "userPwd"> password </param> 6 /// <param name = "verifyCode"> Verification Code </param> 7 void LoginDo (string userName, string userPwd, string verifyCode) 8 {9 // 1. log on to 10 IDictionary <string, string> postData = new Dictionary <string, string> (); 11 postData. add ("referer", HttpUtility. urlEncode (" http://www.jikexueyuan.com/ "); 12 postData. add ("uname", userName); 13 postData. add ("password", userPwd); 14 postData. add ("is_ajax", "1"); 15 postData. add ("verify", verifyCode); 16 HttpResponseParameter responseParameter = HttpProvider. excute (new HttpRequestParameter17 {18 Url =" http://passport.jikexueyuan.com/submit/login?is_ajax=1 ", 19 IsPost = true, 20 Parameters = postData, 21 Cookie = SessionCookie22}); 23 24 LoginResultEntity loginResult = responseParameter. body. deserializeObject <LoginResultEntity> (); 25 if (loginResult. status = 1) 26 {27 // 2. login successful. Save cookie28 CookieStoreInstance. currentCookie = responseParameter. cookie; 29} 30 31 // MessageBox. show (string. format ("body = {0}, cookie = {1}", Unicode2String (responseParameter. body), 32 // responseParameter. cookie. cookieString); 33}Logon code
Process description: first access any page of geek College (the homepage is used as an example here), obtain the session cookie accessed by the current user (the cookie is not logged in here), and then pull the verification code, encapsulate data to initiate a login request. Save cooki after successful logon to a local file or global variable (used in this example) for downloading the video.
(3) After the logon is successful, you can obtain the video. Here we use http://www.jikexueyuan.com/course/360.htmlas an example. Access this address, as shown in figure
View the source code, we will find a piece of surprising code: <source src = "http://cv4.jikexueyuan.com/ae892b3b4a8c63fa579af4d2c6f6bb03/201512151558/csharp/course_360/01/video/c360b_01_h264_sd_960_540.mp4" type = "video/mp4"/>, the src address is the video address we need (this code is available only after logon ). Therefore, you should first parse this page to obtain the link to the lesson list. Then, initiate an Http request (with cookies) to retrieve the video address one by one, and download and save the video to the local device.
1 public List <LessonEntity> GetLessonEntities (string url, HttpCookieType sessionCookie) 2 {3 // request the lesson List, parse the html source code, and extract the lesson information and link 4 HttpResponseParameter responseParameter = HttpProvider. excute (new HttpRequestParameter () 5 {6 IsPost = false, 7 Url = url, 8 Encoding = Encoding. UTF8, 9 Cookie = sessionCookie10}); 11 12 List <LessonEntity> results = new List <LessonEntity> (); 13 14 HtmlDocument htmlDocument = New HtmlDocument (); 15 htmlDocument. loadHtml (responseParameter. body); 16 HtmlNode rootNode = htmlDocument. documentNode; 17 HtmlNodeCollection lessonNodes = rootNode. selectNodes ("// div [@ id = \" pager \ "]/div [3]/div [2]/div [2]/ul/li "); 18 foreach (HtmlNode lessonNode in lessonNodes) 19 {20 HtmlNode aNode = lessonNode. selectSingleNode ("div [1]/h2 [1]/a [1]"); 21 if (aNode! = Null) 22 {23 results. add (new LessonEntity24 {25 Title = aNode. innerText. trim (), 26 Href = aNode. getAttributeValue ("href", string. empty) 27}); 28} 29} 30 31 return results; 32}Obtain the class list public string GetVideoUrl (string url, HttpCookieType sessionCookie) {HttpResponseParameter responseParameter = HttpProvider. excute (new HttpRequestParameter {Url = url, Cookie = sessionCookie}); // regular video file address in advance string result = Regex. match (responseParameter. body, "<source. * src = \"(. +? \ "). */>"). Groups [1]. Value. Replace ("\" ", string. Empty); sessionCookie = responseParameter. Cookie; return result ;}Obtain the video address public void DownloadVideo (string filePath, string url, HttpCookieType sessionCookie, Action action = null) {string folder = Path. GetDirectoryName (filePath); if (! String. IsNullOrEmpty (folder )&&! Directory. Exists (folder) {// if the Directory folder does not exist, create Directory. CreateDirectory (folder);} if (action! = Null) {action ();} HttpProvider. excute (new HttpRequestParameter {Url = url, ResponseEnum = HttpResponseEnum. stream, Cookie = sessionCookie, StreamAction = x => {NetExtensions. writeFile (filePath, x) ;}}); // WebClient webClient = new WebClient (); // webClient. downloadFile (new Uri (url, UriKind. relativeOrAbsolute), filePath );}Download Video 1 public void DownloadCode (string filePath, string courseId, HttpCookieType sessionCookie) 2 {3 if (File. exists (filePath) return; 4 5 HttpResponseParameter responseParameter = HttpProvider. excute (new HttpRequestParameter 6 {7 Url = string. format ("http://www.jikexueyuan.com/course/downloadRes? Course_id = {0} ", courseId), 8 Cookie = sessionCookie 9}); 10 CodeDownloadResultEntity result = responseParameter. body. deserializeObject <CodeDownloadResultEntity> (); 11 if (result. code == 200) 12 {13 string folder = Path. getDirectoryName (filePath); 14 if (! String. IsNullOrEmpty (folder )&&! Directory. exists (folder) 15 {16 // If the Directory folder does not exist, 17 directories are created. createDirectory (folder); 18} 19 20 WebClient webClient = new WebClient (); 21 webClient. downloadFile (new Uri (result. data. url, UriKind. relativeOrAbsolute), filePath); 22} 23}Download courseware and source code
(4) The video download function has been completed. So we want to batch download geek College [professional path picture course video] is quite simple (http://ke.jikexueyuan.com/zhiye/), step by step analysis page, get the video playback page address, download the video (the parsing page mainly uses regular expressions and the HtmlAgilityPack framework ).
(5) In this way, the career path graph course of geek college can be fully managed.
4. Download source code
Geek college video download program
"Don't forget about the rich". If you think this article is good, whether you have been diving, would you like it?