Python detailed process of crawling Coursera course resources, coursera Course Resources
Sometimes we need to add some classic things to our favorites and review them from time to time. Some courses on Coursera are undoubtedly classic. Most of Coursera's finishing courses provide complete teaching resources, including ppt, video, and subtitles. It is very easy to learn offline. Obviously, we won't download a file or a file. It's just a fool. Programmers are smart people!
What should we do if we are smart? Write a script to download it in batches. First, we need to analyze the manual download process: log on to your Coursera account (some courses need to be logged on and selected before we can see the corresponding resources). On the course resources page, find the corresponding file link and download it with your favorite tool.
It's easy, right? We can use a program to imitate the above steps, so that we can free up our hands. The entire program can be divided into three parts:
Log on to Coursera. On the course resources page, find the resource link. select an appropriate tool based on the resource link to download the resource.
The specific implementation is as follows!
Login
At the beginning, I did not add a logon module. I thought that the visitor could download the corresponding course resources.comnetworks-002This course automatically jumps to the logon page when a visitor accesses the resource page, which is the case when chrome accesses the course resource page in stealth mode.
To simulate logon, first find the logon page, and then use google'sDeveloper ToolsAnalyze how the account password is uploaded to the server.
Enter the account password in the form on the logon page, and then click log on. At the same time, we need to keep our eyes onDeveloper Tools——NetworkTo find the url for submitting account information. In general, if you want to submit information to the server, the post Method is generally used. Here we only need to first find the url with the Method of post. The tragedy is that the address for submitting account information cannot be found in the Network every time you log on to the account. After the login is successful, you can directly jump to the page after the login is successful, and the content you want to find will flash through.
Therefore, I entered a set of account passwords and deliberately failed to log on. I found the post page address, for example:
Address:https://accounts.coursera.org/api/v1/login. To know what content has been submitted to the server, further observe the content in the form on the post page, such:
We can see that there are three fields in total:
Email: account registration email password: account password webrequest: additional field, value: true.
Next, let's start writing. I chose to use pythonRequestsLibrary to simulate logon. This is the description on the Requests official website.
Requests is an elegant and simple HTTP library for Python, built for human beings.
In fact, requests is simple and convenient to use. It is an http Library specially designed for humans. Requests providesSession ObjectWhich can be used to transmit the same data in different requests. For example, each request carries a cookie.
The initial code is as follows:
signin_url = "https://accounts.coursera.org/api/v1/login"
logininfo = {"email": "...",
"password": "...",
"webrequest": "true"
}
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/36.0.1985.143 Safari/537.36")
post_headers = {"User-Agent": user_agent,
"Referer": "https://accounts.coursera.org/signin"
}
coursera_session = requests.Session()
login_res = coursera_session.post(signin_url,
data=logininfo,
headers=post_headers,
)
if login_res.status_code == 200:
print "Login Successfully!"
else:
print login_res.text
Store the content submitted in the form in the dictionary and pass it to the Session. post function as a data parameter. In general, it is best to add a requestUser-Agent,RefererIn the request header, the User-Agent is used to simulate browser requests. the Referer is used to tell the server that I jumped from the referer page to the request page, sometimes the server checks the Referer field of the request to ensure that the request is redirected from a fixed address to the current request page.
The running result of the above snippet is very strange, and the following information is displayed:Invalid CSRF Token. Later, I found a Coursera batch download script on github and found that there were more headers when someone sent a page request.XCSRF2Cookie, XCSRF2Token, XCSRFToken, cookie4 fields. So I looked at the request header of the post page again and found that there are indeed these fields, which are estimated to be used by the server side for some restrictions.
After logging on several times in the browser, we found that XCSRF2Token, XCSRFToken is a random string with a length of 24, XCSRF2Cookie is a "csrf2_token _" with a random string with a length of 8. However, I have never figured out how the Cookie was obtained. However, looking at the code above on github, the Cookie seems to be a combination of "csrftoken" and the other three. I tried it.
It is enough to add the following parts to the original code.
def randomString(length):
return ''.join(random.choice(string.letters + string.digits) for i in xrange(length))
XCSRF2Cookie = 'csrf2_token_%s' % ''.join(randomString(8))
XCSRF2Token = ''.join(randomString(24))
XCSRFToken = ''.join(randomString(24))
cookie = "csrftoken=%s; %s=%s" % (XCSRFToken, XCSRF2Cookie, XCSRF2Token)
post_headers = {"User-Agent": user_agent,
"Referer": "https://accounts.coursera.org/signin",
"X-Requested-With": "XMLHttpRequest",
"X-CSRF2-Cookie": XCSRF2Cookie,
"X-CSRF2-Token": XCSRF2Token,
"X-CSRFToken": XCSRFToken,
"Cookie": cookie
}
So far, the logon function is initially implemented.
Analysis Resource Link
After successful login, we only need to get the content on the resource page, and then filter out the desired resource link. The address of the resource page is very simplehttps://class.coursera.org/name/lecture, Where name is the course name. For example, for course comnetworks-002, the resource page address is https://class.coursera.org/comnetworks-002/lecture.
After capturing page resources, we need to analyze the html file. Here we choose to useBeautifulSoup. BeautifulSoup is a Python library that can extract data from HTML or XML files. It is quite powerful. Detailed documents are provided on the official website. Before using BeautifulSoup, we have to find out the resource link rules to facilitate filtering.
The total weekly subject of the course isclass=course-item-list-headerUnder the div labelclass=course-item-list-section-listUnder the ul label, each course is in a li label, and the course resources are in the div label of the li label.
After checking several courses, it is easy to filter resource links, as shown below:
Ppt and ppt resources: Match links with regular expressions; subtitle resources: Findtitle="Subtitles (srt)"To obtain itshrefProperty; video resource: Foundtitle="Video (MP4)"To obtain itshrefAttribute.
Subtitles and videos can also be filtered using regular expressions, but BeautifulSoup can be used to match the title attribute, which is easier to understand. The ppt and pdf resources do not have a fixed title attribute, so they must be matched using regular expressions.
The Code is as follows:
soup = BeautifulSoup(content)
chapter_list = soup.find_all("div", class_="course-item-list-header")
lecture_resource_list = soup.find_all("ul", class_="course-item-list-section-list")
ppt_pattern = re.compile(r'https://[^"]*\.ppt[x]?')
pdf_pattern = re.compile(r'https://[^"]*\.pdf')
for lecture_item, chapter_item in zip(lecture_resource_list, chapter_list):
# weekly title
chapter = chapter_item.h3.text.lstrip()
for lecture in lecture_item:
lecture_name = lecture.a.string.lstrip()
# get resource link
ppt_tag = lecture.find(href=ppt_pattern)
pdf_tag = lecture.find(href=pdf_pattern)
srt_tag = lecture.find(title="Subtitles (srt)")
mp4_tag = lecture.find(title="Video (MP4)")
print ppt_tag["href"], pdf_tag["href"]
print srt_tag["href"], mp4_tag["href"]
Download Resources
Now that you have obtained the resource link, it is easy to download the part. Here I choose to download it using curl. The specific idea is simple, that is, outputcurl resource_link -o file_nameTo a seed file, such as feed. sh. In this way, you only need to grant permissions to the seed file and then run the seed file.
To classify course resources easily, you can create a folder for the course weekly title, and download all courses of the week to this directory. To help us quickly locate all the resources of each lesson, we can name all the resource files of a lessonCourse name. File Type. The specific implementation is relatively simple, and no specific program is provided here. Take a look at the feed. sh file in the next test example. Some content is as follows:
mkdir 'Week 1: Introduction, Protocols, and Layering'
cd 'Week 1: Introduction, Protocols, and Layering'
curl https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf -o '1-1 Goals and Motivation (15:46).pdf'
curl https://class.coursera.org/comnetworks-002/lecture/subtitles?q=25_en&format=srt -o '1-1 Goals and Motivation (15:46).srt'
curl https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=25 -o '1-1 Goals and Motivation (15:46).mp4'
curl https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf -o '1-2 Uses of Networks (17:12).pdf'
curl https://class.coursera.org/comnetworks-002/lecture/subtitles?q=11_en&format=srt -o '1-2 Uses of Networks (17:12).srt'
curl https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=11 -o '1-2 Uses of Networks (17:12).mp4'
So far, we have successfully completed the goal of crawling Coursera course resources, and put the specific code on gist. In use, we only need to run the program and pass the course name as a parameter to the program (the course name here is not the complete name of the entire course, instead, the abbreviated name in the course Introduction page address, such as the Computer Networks course, the course name is comnetworks-002 ).
In fact, this program can be seen as a simple small crawler. The following describes the concept of crawler.
Crawlers are not simple at all.
This is what we say on the wiki about crawler.
A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
The overall architecture of the crawler is as follows (the image is from wiki ):
In short, crawlers obtain the initial urls from Scheduler, download the corresponding page, store useful data, and analyze the links on the page, if you do not have any access, add it to Scheduler to wait for the page to be crawled.
Of course, there are some protocols to constrain the behavior norms of crawlers. For example, many websites haverobots.txtFile to specify which content can be crawled and which cannot.
Each search engine has a powerful crawler program that extends its tentacles to all corners of the network, constantly collects useful information, and builds indexes. This search engine-level crawler is very complicated to implement, because the number of pages on the network is too large, but it is very difficult to traverse them, let alone analyze page information, and the index is created.
In practical applications, we only need to crawl a specific site and capture a small amount of resources, which is much simpler to implement. However, there are still many headaches. For example, many page elements are generated by javascript. At this time, we need a javascript Engine to render the entire page and filter it out.
What's worse, many sites use some measures to prevent crawlers from crawling resources, such as limiting the number of visits to the same IP address for a period of time or limiting the interval between two operations, add the verification code. In most cases, we do not know how the server side prevents crawlers, so it is really difficult to make crawlers work.
Python crawler: how to obtain the list of all Coursera courses?
Coursera's webpage contains javascript. without running it, you can only view the source code of the webpage and cannot see what you want, so it hurts to get up. We recommend that you use webkit or read some other parsing methods related to the JavaScript contained in the webpage.
What are the advantages of Coursera?
After that, I also took several other courses selectively, but I had to leave after a week or two because I had little experience. One is UW's Computational Method for Data Analysis, which is equivalent to the junior or senior course of the application mathematics department, including Fourier transform wavelet Transform and Its Application in image processing and signal processing, the other is the Scientific Computing of UW, which is equivalent to the calculation method in China. Rice's course teacher carefully prepared a video dedicated to online learning recording. The project was also a project that students around the world randomly assigned to others, then the median is the score of each job. UW's two courses are videos of UW's class directly, and the homework of machine correction is boring. Therefore, Coursera's courses are also uneven and need to be screened, but the overall quality is still relatively high. I plan to take some social science courses now. I am waiting for the course class to begin with an English writing course and a philosophical entry-level course. The former is Ohio State, and the latter is from Edinburgh University. Expectation. I 've written a blog just completed the course of Python and copied it to you. I took my first online course on Coursera, Introduction to Interactive Programming in Python. I want to share my feeling communicating with my classmates in the discussion forum. first, everybody was encouraged to make even very small improvements. when doing the mini-project, some guys added very little features with few lines of codes, everybody gave positive feed-backs, innovation, no matter how Small, was welcomed. but in our learning atmosphere, if one guy says he know 1 + 2 = 3 from 1 + 1 = 2, nobody will give it a credit, because it is so simple and will be regarded as meaningless. however. as this altitude is forged in a learner's mind, he will dare not and will not be able to make big improvements or innovation. second, age is not the problem. as sharing is encouraged by sharing tutorial Videos on YouTube, I watched some of them. I find a boy under 7 gave one of the videos! I am so surprised that he talked with great confidence and aura (how to translate ?). On Ted, there was one 14-year-old guy who made nuclear fusion, but this 7-year-old boy... the remaining full text>