Python crawls the detailed process of Coursera course resources

Source: Internet
Author: User
Sometimes we need to put some classic things in the collection, always aftertaste, and Coursera on some of the courses are undoubtedly classic. Most of the end courses in Coursera provide a complete set of teaching resources, including PPT, video and subtitles, which will be very easy to learn when offline. Obviously, we will not go to a file to download a file, Only fools do so, programmers are smart!





What are we going to do with the smart guy? Of course, write a script that was downloaded in bulk. First we need to analyze the process of manual download: Login to your Coursera account (some courses require us to log in and select a class to see the corresponding resources), on the Course Resources page, find the corresponding file link, and then use the tools you like to download.



It's simple, isn't it? We can use the procedure to imitate the above steps, so that we can free our hands. The entire program is divided into three parts:



Login to Coursera, find the resource link in the Course Resources page, and select the appropriate tool to download resources according to the resource link.



Here is the concrete implementation of the following bar!



Login



In the beginning, I did not add the login module, thought that the visitors can download the corresponding course resources, and later in the test ofcomnetworks-002the course when the visitors to the resource page will automatically jump to the login interface, is chrome in Incognito mode access to the course Resource page situation.






To simulate a login, we first find the login page and then use Google'sDeveloper Toolsanalytics account password to upload it to the server.



We fill in the account password in the form on the login page, then click Sign In. At the same time, we need to keep our eyes closeDeveloper Tools——Networkand find the URL to submit your account information. In general, if you want to submit information to the server, generally use the Post method, here we just need to find the method is the post URL. The tragedy is that every time you log in to the account, the network can not find the address to submit account information. After guessing the successful login, jump directly to the successful login page, the content you want to find a flash past.



So I randomly entered a set of account password, intentional login failed, indeed found the post page address, such as:






The address is:https://accounts.coursera.org/api/v1/login. To see what is being submitted to the server, further observe the contents of the form in the post page, such as:






We see a total of three fields:



Email: Account Registration Email Password: Account password WebRequest: Attached field, value is true.



Then write it, I choose to use the PythonRequestslibrary to simulate the login, about the requests official website is described.


Requests is a elegant and simple HTTP library for Python, built for human beings.


In fact, requests is really simple and convenient to use, no loss is specifically designed for human HTTP library. Requests provides theSession对象ability to pass some of the same data in different requests, such as carrying a cookie in each request.



The initial code is as follows:


signin_url = "https://accounts.coursera.org/api/v1/login"
logininfo = {"email": "...",
             "password": "...",
             "webrequest": "true"
             }
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) "
              "AppleWebKit/537.36 (KHTML, like Gecko) "
              "Chrome/36.0.1985.143 Safari/537.36")
post_headers = {"User-Agent": user_agent,
                "Referer": "https://accounts.coursera.org/signin"
                }
coursera_session = requests.Session()
login_res = coursera_session.post(signin_url,
                                  data=logininfo,
                                  headers=post_headers,
                                  )
if login_res.status_code == 200:
    print "Login Successfully!"
else:
    print login_res.text


Place the content submitted in the form in a dictionary and pass it to the Session.post function as the data parameter. In general, it is best to add the requestUser-Agent,Referersuch as the request header, User-agent used to simulate the browser request, Referer used to tell the server I was from the Referer page to the request page, Sometimes the server checks the requested Referer field to ensure that it jumps from the fixed address to the current request page.



The result of the above fragment is very strange, showing the following information:Invalid CSRF Token. Later on GitHub search for a Coursera bulk download script, found that people send page requests headers more thanXCSRF2Cookie, XCSRF2Token, XCSRFToken, cookie4 fields. So we re-looked at the post page of the request header, found that there are indeed these fields, it is estimated that the server side to make some restrictions.



A few times in the browser, found Xcsrf2token, Xcsrftoken is a 24-length random string, Xcsrf2cookie "Csrf2_token_" plus a 8-length random string. But I've never figured out how cookies are made, but looking at the code on GitHub, cookies seem to be just "Csrftoken" and the other three combinations and try it out.



Adding the following sections to the original code is sufficient.


def randomString(length):
    return ''.join(random.choice(string.letters + string.digits) for i in xrange(length))
XCSRF2Cookie = 'csrf2_token_%s' % ''.join(randomString(8))
XCSRF2Token = ''.join(randomString(24))
XCSRFToken = ''.join(randomString(24))
cookie = "csrftoken=%s; %s=%s" % (XCSRFToken, XCSRF2Cookie, XCSRF2Token)
post_headers = {"User-Agent": user_agent,
                "Referer": "https://accounts.coursera.org/signin",
                "X-Requested-With": "XMLHttpRequest",
                "X-CSRF2-Cookie": XCSRF2Cookie,
                "X-CSRF2-Token": XCSRF2Token,
                "X-CSRFToken": XCSRFToken,
                "Cookie": cookie
                }


At this point the login function is initially implemented.



Analyze Resource Links



After successful login, we just need to get to the content of the resource page, and then filter out the resources we need to link the line. The address of the resource page is simple,https://class.coursera.org/name/lecturewhere name is the course name. For example, for course comnetworks-002, the resource page address is https://class.coursera.org/comnetworks-002/lecture.



After crawling to the page resource, we need to parse the HTML file, which we choose to useBeautifulSoup. BeautifulSoup is a Python library that can extract data from HTML or XML files, and is quite powerful. Specific use of the official website has a very detailed documentation, not to repeat here. Before using BeautifulSoup, we also need to find out the laws of resource links, so that we can filter.



The total topics for each week of the course areclass=course-item-list-headerunder the div tag, the weekly course is underclass=course-item-list-section-listthe UL tag, each lesson is in an Li tag, and the course resource is in the div tag of the Li tag.



After reviewing several courses, it is easy to find a way to filter resource links, as follows:



PPT and PPT resources: match links with regular expressions, subtitle resources: Findtitle="Subtitles (srt)"tags, take theirhrefattributes, video resources: Findtitle="Video (MP4)"the tags, take theirhrefattributes.



Subtitles and videos can also be filtered with regular expressions, but with BeautifulSoup based on the title attribute, there is better readability. PPT and PDF resources, without a fixed title attribute, had to use regular expressions to match.



The specific code is as follows:


soup = BeautifulSoup(content)
chapter_list = soup.find_all("div", class_="course-item-list-header")
lecture_resource_list = soup.find_all("ul", class_="course-item-list-section-list")
ppt_pattern = re.compile(r'https://[^"]*\.ppt[x]?')
pdf_pattern = re.compile(r'https://[^"]*\.pdf')
for lecture_item, chapter_item in zip(lecture_resource_list, chapter_list):
    # weekly title
    chapter = chapter_item.h3.text.lstrip()
    for lecture in lecture_item:
        lecture_name = lecture.a.string.lstrip()
        # get resource link
        ppt_tag = lecture.find(href=ppt_pattern)
        pdf_tag = lecture.find(href=pdf_pattern)
        srt_tag = lecture.find(title="Subtitles (srt)")
        mp4_tag = lecture.find(title="Video (MP4)")
        print ppt_tag["href"], pdf_tag["href"]
        print srt_tag["href"], mp4_tag["href"]


Download Resources



Now that you've got the resource link, the download section is easy, and here I choose to download it using curl. The specific idea is simple, is to outputcurl resource_link -o file_nameto a seed file, such as to feed.sh. This only needs to execute permissions on the seed file and then run the seed file.



To facilitate the categorization of course resources, you can create a folder for the title of the course each week, after which all courses are downloaded in that directory. To make it easy for us to quickly locate all the resources for each lesson, you can name all of the resource files for a lesson课名.文件类型. The concrete implementation is quite simple, here no longer gives the concrete procedure. Take a look at the feed.sh file in a test example, some of which are as follows:


 mkdir ' Week 1:introduction, protocols, and layering ' 
 CD ' Week 1: Introduction, protocols, and layering ' 
 Curl https://d396qusza40orc.cloudfront.net/comnetworks/lect/1- Readings.pdf-o ' 1-1 goals and motivation (15:46). pdf ' 
 Curl https://class.coursera.org/comnetworks-002/lecture/ Subtitles?q=25_en&format=srt-o ' 1-1 goals and motivation (15:46). SRT ' 
 Curl https://class.coursera.org/ Comnetworks-002/lecture/download.mp4?lecture_id=25-o ' 1-1 goals and motivation (15:46). mp4 ' 
 Curl https:// D396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf-o ' 1-2 Uses of Networks (17:12). pdf ' 
 Curl https:// Class.coursera.org/comnetworks-002/lecture/subtitles?q=11_en&format=srt-o ' 1-2 Uses of Networks (17:12). SRT ' 
 Curl Https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=11-o ' 1-2 Uses of Networks ( 17:12). mp4 '


So far, we have successfully completed the goal of crawling Coursera course resources, with specific code on gist. When used, we only need to run the program and pass the course name as a parameter to the program (the course name is not the full name of the whole course, but the abbreviated name in the address of the course Introduction page, such as the computer networks course, The course name is comnetworks-002).



In fact, this program can be seen as a simple small reptile program, the following is a rough introduction to the concept of crawler.



It's not a simple reptile.



about what a reptile is, that's what the wiki says.


A web crawler is an Internet bot that systematically browses the world Wide Web, typically for the purpose of Web indexing .


The overall structure of the crawler is as follows (image from Wiki):






Simply put, the crawler obtains the initial URLs from the scheduler, downloads the corresponding pages, stores the useful data, analyzes the links in the page, and, if it has access to the pass, adds to the scheduler waiting for the page to crawl.



There are, of course, some protocols that govern the behavior of crawlers, such as many websites that have arobots.txtfile that specifies what content can be crawled and what is not.



Behind each search engine is a powerful crawler that extends tentacles to all corners of the network, collecting useful information and indexing. This search engine-level crawler is very complex, because the number of pages on the network is too large, but it is very difficult to traverse them, not to mention the analysis of page information, and set up the index.



In practical applications, we only need to crawl a specific site and crawl a small amount of resources, so it is much simpler to implement. But there are still a lot of headaches, like many page elements being generated by JavaScript, when we need a JavaScript engine that renders the entire page and then filters it.



Even worse, many sites use measures to stop crawlers from crawling resources, such as restricting the number of accesses to the same IP over time, or limiting the time interval of two operations, adding verification codes, and so on. In most cases, we do not know how the server side is to prevent the crawler, so it is very difficult to make the crawler to work.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.