Python crawls the detailed process of Coursera course Resources _python

Source: Internet
Author: User
Tags curl time interval


Sometimes we need to collect some classic things, always aftertaste, and Coursera on some of the courses is undoubtedly classic. Most of the completed courses in Coursera provide a complete set of teaching resources, including PPT, video and subtitles, which will be very easy to learn when offline. It is obvious that we will not go to a file to download a file, Only fools do so, programmers are smart!



What are we smart people going to do? Of course is to write a script to download the bulk. First we need to analyze the manual download process: Login to their own Coursera account (some courses require us to log in and choose to see the corresponding resources), in the Course Resources page, find the appropriate file links, and then use the favorite tools to download.



It's simple, isn't it? We can use the procedure to imitate the above steps, so that we can liberate our hands. The whole process is divided into three parts:



Log in Coursera, find resource links in the course Resource page, and select the appropriate tool to download resources based on the resource link.



Below to come to the specific implementation of the following bar!



Login



In the beginning, I did not add a login module, so that visitors can download the appropriate course resources, and later in the testcomnetworks-002of the course found that visitors to access the resource page will automatically jump to the login interface, the following figure is the chrome in stealth mode access to the course Resources page.






To simulate a login, we first find the login page and then use Google'sDeveloper Toolsanalytics account password to upload to the server.



We enter the account password in the form of the login page and click Login. At the same time, we need to keep our eyes fixedDeveloper Tools——Networkand find the URL to submit our account information. In general, if you want to submit information to the server, generally use the Post method, where we only need to find the URL of methods for post. The tragedy is, each login account, network can not find the address to submit account information. After the success of the successful login, jump directly to the successful login after the page, want to find content a flash over.



So casually entered a group of account password, intentional login failed, really found the post page address, the following figure:






The address is:https://accounts.coursera.org/api/v1/login. To see what is being submitted to the server, look further at the contents of the form in the Post page, as shown below:






We see a total of three fields:



Email: Account Number of the registered Mailbox password: account password WebRequest: Additional fields, the value is true.



Then write it, I choose to use PythonRequestslibrary to simulate login, about requests official website is this introduction.


Requests is a elegant and simple HTTP library for Python, built for human beings.


In fact, requests is really simple and convenient to use, no loss is designed for the human HTTP library. Requests providesSession对象, which can be used to pass some of the same data in different requests, such as carrying cookies in each request.



The initial code is as follows:


signin_url = "https://accounts.coursera.org/api/v1/login"
logininfo = {"email": "...",
             "password": "...",
             "webrequest": "true"
             }
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) "
              "AppleWebKit/537.36 (KHTML, like Gecko) "
              "Chrome/36.0.1985.143 Safari/537.36")
 post_headers = {"User-agent": user_agent, 
                  "Referer": "Https://accounts.coursera.org/signin" 
                 } 
 Coursera_session = requests. Session ()
 login_res = Coursera_session.post (Signin_url, 
                                     Data=logininfo, 
                                     headers=post_headers, 
                                    ) 
 If Login_res.status_code =: 
     Print Login successfully! " 
 Else: 
     print Login_res.text


The content submitted in the form is stored in the dictionary and then passed as the data parameter to the Session.post function. In general, it is best to add a requestUser-Agent,Referersuch as request headers, user-agent used to simulate browser requests, Referer used to tell the server that I was jumping from the Referer page to the request page, Sometimes the server checks the requested Referer field to ensure that it jumps from a fixed address to the current request page.



The results of the above fragment are very strange and display the following information:Invalid CSRF Token. Later on the GitHub search for a Coursera bulk download script, found that people send a page request headers more thanXCSRF2Cookie, XCSRF2Token, XCSRFToken, cookie4 fields. Then again looked at the Post page request head, found that there are really these fields, it is estimated that the server side to do some restrictions.



Logged in with a browser several times and found Xcsrf2token, Xcsrftoken is a random string of length 24, Xcsrf2cookie "Csrf2_token_" plus a random string of length 8. But has not been able to understand how the cookie is to find out, but look at the github above code, the cookie seems to be just "Csrftoken" and the other three combinations, try it unexpectedly.



Adding the following sections to the original code is sufficient.


def randomstring (length):
Return ". Join (Random.choice (string.letters + string.digits) for I in xrange (length))
Xcsrf2cookie = ' csrf2_token_%s '% '. Join (RandomString (8))
Xcsrf2token = '. Join (RandomString (24))
Xcsrftoken = '. Join (RandomString (24))
cookie = "csrftoken=%s; %s=%s "% (Xcsrftoken, Xcsrf2cookie, Xcsrf2token)
Post_headers = {"User-agent": user_agent,
"Referer": "Https://accounts.coursera.org/signin",
"X-requested-with": "XMLHttpRequest",
"X-csrf2-cookie": Xcsrf2cookie,
"X-csrf2-token": Xcsrf2token,
"X-csrftoken": Xcsrftoken,
"Cookie": Cookie
}


This login function is initially implemented.



Analyze Resource Links



After the login is successful, we just need to get to the content of the resource page and then filter out the resource links that we need. The address of the resource page is simple,https://class.coursera.org/name/lecturewhere name is the course name. For example, for course comnetworks-002, the resource page address is https://class.coursera.org/comnetworks-002/lecture.



After crawling to the page resources, we need to parse the HTML file, where we choose to useBeautifulSoup. BeautifulSoup is a Python library that can extract data from HTML or XML files, which is quite powerful. Specific use of the official web site has very detailed documentation, no longer repeat here. Before using BeautifulSoup, we also have to find out the rules of resource links to facilitate our filtering.



The total subject of the course weekly isclass=course-item-list-headerunder the div tag, the weekly course is underclass=course-item-list-section-listthe UL tag, each lesson is in an Li tag, course resources are in the div tag in the Li tag.



After reviewing several courses, it is easy to find a way to filter resource links as follows:



PPT and PPT resources: matching links with regular expressions; Caption resource: Foundtitle="Subtitles (srt)"label, take itshrefattribute; Video resource: Foundtitle="Video (MP4)"label, take itshrefattribute.



Subtitles and video can also be filtered using regular expressions, but BeautifulSoup based on the title attribute is better readable. and PPT and PDF resources, there is no fixed title attribute, had to use regular expression to match.



The specific code is as follows:


Soup = beautifulsoup (content)
Chapter_list = Soup.find_all ("div", class_= "Course-item-list-header")
Lecture_resource_list = Soup.find_all ("ul", class_= "Course-item-list-section-list")
Ppt_pattern = Re.compile (R ' https://[^ "]*\.ppt[x]?")
Pdf_pattern = Re.compile (R ' https://[^ "]*\.pdf ')
For Lecture_item, Chapter_item in Zip (Lecture_resource_list, chapter_list):
# Weekly Title
Chapter = Chapter_item.h3.text.lstrip ()
For lecture in Lecture_item:
Lecture_name = Lecture.a.string.lstrip ()
# Get Resource link
Ppt_tag = Lecture.find (Href=ppt_pattern)
Pdf_tag = Lecture.find (Href=pdf_pattern)
Srt_tag = Lecture.find (title= "subtitles (SRT)")
Mp4_tag = Lecture.find (title= "Video (MP4)")
Print ppt_tag["href"], pdf_tag["href"]
Print srt_tag["href"], mp4_tag["href"]


Download Resources



Now that you have the resource link, the download section is very easy, here I choose to use Curl to download. The specific idea is very simple, is outputcurl resource_link -o file_nameto a seed file, such as to feed.sh. This only requires that you execute permissions on the seed file, and then run the seed file.



To facilitate the categorization of course resources, you can set up a folder for the weekly title of the course, after which all courses are downloaded in the directory. To make it easy for us to quickly locate all the resources in each lesson, you can name all the resource files for a class课名.文件类型. Concrete implementation is relatively simple, here no longer give specific procedures. You can take a look at the feed.sh file in a test example, some of which are as follows:


 mkdir ' Week 1:introduction, protocols, and layering ' 
 CD ' Week 1:introduction, protocols, and layering ' 
 Curl Https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf-o ' 1-1 Goals and Motivation (15:46). pdf ' 
 Curl Https://class.coursera.org/comnetworks-002/lecture/subtitles?q=25_en&format=srt-o ' 1-1 Goals and Motivation (15:46). SRT ' 
 Curl Https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=25-o ' 1-1 Goals and Motivation (15:46). mp4 ' 
 Curl Https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf -O ' 1-2 Uses of Networks (17:12). pdf ' 
 Curl https://class.coursera.org/comnetworks-002/lecture/subtitles?q=11_en &format=srt-o ' 1-2 Uses of Networks (17:12). SRT ' 
 Curl https://class.coursera.org/comnetworks-002/lecture/ Download.mp4?lecture_id=11-o ' 1-2 Uses of Networks (17:12). mp4 '


Up to this point, we have successfully completed the goal of crawling Coursera course resources, with specific code on gist. When used, we just need to run the program and pass the course name as a parameter to the program (the course name here is not the full name of the entire course, but the abbreviated name in the course Introduction page address, such as computer networks, The course name is comnetworks-002).



In fact, this program can be seen as a simple small crawler program, the following a rough introduction to the concept of the reptile.



It's not even a simple reptile.



about what a reptile is, that's what the wiki says.


A web crawler is A Internet bot this systematically browses the world Wide Web, typically for the purpose of Web indexing .


The overall architecture of the crawler is as follows (images from wikis):






In simple terms, the crawler gets the initial URLs from the scheduler, downloads the corresponding pages, stores the useful data, analyzes the links in the page, and joins the scheduler to wait for the crawl page if it has been accessed.



Of course, there are protocols to constrain the behavior of the crawler, such as many sites have arobots.txtfile to specify what content can be crawled, which cannot.



Behind each search engine, there is a powerful reptile program that extends its tentacles to all corners of the network, collecting useful information and indexing. This kind of search engine level crawler is very complex, because the number of pages on the network is too large, but it is difficult to traverse them, not to mention the analysis of page information, and indexed.



In practical application, we only need to crawl a specific site, grab a small amount of resources, so much simpler to achieve. But there are still a lot of headaches, like many page elements are generated by JavaScript, when we need a JavaScript engine that renders the entire page and filters it.



To make things worse, many sites use measures to prevent crawlers from crawling resources, such as limiting the number of accesses to the same IP for a period of time, or restricting the time interval of two operations, adding validation codes, and so on. In most cases, we don't know how the server side is going to prevent the crawler, so it's really hard to make the crawler work.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.