This article mainly introduces the detailed process of Python crawling Coursera course resources. For more information, see some typical things, some Coursera courses are undoubtedly classic. Most of Coursera's finishing courses provide complete teaching resources, including ppt, video, and subtitles. it is very easy to learn offline. Obviously, we won't download a file or a file. it's just a fool. programmers are smart people!
What should we do if we are smart? Write a script to download it in batches. First, we need to analyze the manual download process: log on to your Coursera account (some courses need to be logged on and selected before we can see the corresponding resources). on the course resources page, find the corresponding file link and download it with your favorite tool.
It's easy, right? We can use a program to imitate the above steps, so that we can free up our hands. The entire program can be divided into three parts:
Log on to Coursera. on the course resources page, find the resource link. select an appropriate tool based on the resource link to download the resource.
The specific implementation is as follows!
Login
At the beginning, I did not add a logon module. I thought that the visitor could download the corresponding course resources.comnetworks-002
This course automatically jumps to the logon page when a visitor accesses the resource page, which is the case when chrome accesses the course Resource Page in stealth mode.
To simulate logon, first find the logon page, and then use google'sDeveloper Tools
Analyze how the account password is uploaded to the server.
Enter the account password in the form on the logon page, and then click log on. At the same time, we need to keep our eyes onDeveloper Tools——Network
To find the url for submitting account information. In general, if you want to submit information to the server, the post Method is generally used. here we only need to first find the url with the Method of post. The tragedy is that the address for submitting account information cannot be found in the Network every time you log on to the account. After the login is successful, you can directly jump to the page after the login is successful, and the content you want to find will flash through.
Therefore, I entered a set of account passwords and deliberately failed to log on. I found the post page address, for example:
Address:https://accounts.coursera.org/api/v1/login
. To know what content has been submitted to the server, further observe the content in the form on the post page, such:
We can see that there are three fields in total:
Email: account registration email password: account password webrequest: additional field, value: true.
Next, let's start writing. I chose to use pythonRequests
Library to simulate logon. this is the description on the Requests official website.
Requests is an elegant and simple HTTP library for Python, built for human beings.
In fact, requests is simple and convenient to use. it is an http library specially designed for humans. Requests providesSession object
Which can be used to transmit the same data in different requests. for example, each request carries a cookie.
The initial code is as follows:
signin_url = "https://accounts.coursera.org/api/v1/login"
logininfo = {"email": "...",
"password": "...",
"webrequest": "true"
}
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/36.0.1985.143 Safari/537.36")
post_headers = {"User-Agent": user_agent,
"Referer": "https://accounts.coursera.org/signin"
}
coursera_session = requests.Session()
login_res = coursera_session.post(signin_url,
data=logininfo,
headers=post_headers,
)
if login_res.status_code == 200:
print "Login Successfully!"
else:
print login_res.text
Store the content submitted in the form in the dictionary and pass it to the Session. post function as a data parameter. In general, it is best to add a requestUser-Agent
,Referer
In the request header, the User-Agent is used to simulate browser requests. the Referer is used to tell the server that I jumped from the referer page to the request page, sometimes the server checks the Referer field of the request to ensure that the request is redirected from a fixed address to the current request page.
The running result of the above snippet is very strange, and the following information is displayed:Invalid CSRF Token
. Later, I found a Coursera batch download script on github and found that there were more headers when someone sent a page request.XCSRF2Cookie, XCSRF2Token, XCSRFToken, cookie
4 fields. So I looked at the request header of the post page again and found that there are indeed these fields, which are estimated to be used by the server side for some restrictions.
After logging on several times in the browser, we found that XCSRF2Token, XCSRFToken is a random string with a length of 24, XCSRF2Cookie is a "csrf2_token _" with a random string with a length of 8. However, I have never figured out how the Cookie was obtained. However, looking at the code above on github, the Cookie seems to be a combination of "csrftoken" and the other three. I tried it.
It is enough to add the following parts to the original code.
def randomString(length):
return ''.join(random.choice(string.letters + string.digits) for i in xrange(length))
XCSRF2Cookie = 'csrf2_token_%s' % ''.join(randomString(8))
XCSRF2Token = ''.join(randomString(24))
XCSRFToken = ''.join(randomString(24))
cookie = "csrftoken=%s; %s=%s" % (XCSRFToken, XCSRF2Cookie, XCSRF2Token)
post_headers = {"User-Agent": user_agent,
"Referer": "https://accounts.coursera.org/signin",
"X-Requested-With": "XMLHttpRequest",
"X-CSRF2-Cookie": XCSRF2Cookie,
"X-CSRF2-Token": XCSRF2Token,
"X-CSRFToken": XCSRFToken,
"Cookie": cookie
}
So far, the logon function is initially implemented.
Analysis Resource link
After successful login, we only need to get the content on the resource page, and then filter out the desired resource link. The address of the resource page is very simplehttps://class.coursera.org/name/lecture
, Where name is the course name. For example, for course comnetworks-002, the resource page address is https://class.coursera.org/comnetworks-002/lecture.
After capturing page resources, we need to analyze the html file. here we choose to useBeautifulSoup
. BeautifulSoup is a Python library that can extract data from HTML or XML files. it is quite powerful. Detailed documents are provided on the official website. Before using BeautifulSoup, we have to find out the resource link rules to facilitate filtering.
The total weekly subject of the course isclass=course-item-list-header
Under the p label, every week's courses areclass=course-item-list-section-list
Under the ul label, each course is in a li label, and the course resources are in the p label of the li label.
After checking several courses, it is easy to filter resource links, as shown below:
Ppt and ppt resources: Match links with regular expressions; Subtitle resources: findtitle="Subtitles (srt)"
To obtain itshref
Property; video resource: Foundtitle="Video (MP4)"
To obtain itshref
Attribute.
Subtitles and videos can also be filtered using regular expressions, but BeautifulSoup can be used to match the title attribute, which is easier to understand. The ppt and pdf resources do not have a fixed title attribute, so they must be matched using regular expressions.
The code is as follows:
soup = BeautifulSoup(content)
chapter_list = soup.find_all("p", class_="course-item-list-header")
lecture_resource_list = soup.find_all("ul", class_="course-item-list-section-list")
ppt_pattern = re.compile(r'https://[^"]*\.ppt[x]?')
pdf_pattern = re.compile(r'https://[^"]*\.pdf')
for lecture_item, chapter_item in zip(lecture_resource_list, chapter_list):
# weekly title
chapter = chapter_item.h3.text.lstrip()
for lecture in lecture_item:
lecture_name = lecture.a.string.lstrip()
# get resource link
ppt_tag = lecture.find(href=ppt_pattern)
pdf_tag = lecture.find(href=pdf_pattern)
srt_tag = lecture.find(title="Subtitles (srt)")
mp4_tag = lecture.find(title="Video (MP4)")
print ppt_tag["href"], pdf_tag["href"]
print srt_tag["href"], mp4_tag["href"]
Download resources
Now that you have obtained the resource link, it is easy to download the part. here I choose to download it using curl. The specific idea is simple, that is, outputcurl resource_link -o file_name
To a seed file, such as feed. sh. In this way, you only need to grant permissions to the seed file and then run the seed file.
To classify course resources easily, you can create a folder for the course weekly title, and download all courses of the week to this directory. To help us quickly locate all the resources of each lesson, we can name all the resource files of a lessonCourse Name. file type
. The specific implementation is relatively simple, and no specific program is provided here. Take a look at the feed. sh file in the next test example. some content is as follows:
mkdir 'Week 1: Introduction, Protocols, and Layering'
cd 'Week 1: Introduction, Protocols, and Layering'
curl https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf -o '1-1 Goals and Motivation (15:46).pdf'
curl https://class.coursera.org/comnetworks-002/lecture/subtitles?q=25_en&format=srt -o '1-1 Goals and Motivation (15:46).srt'
curl https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=25 -o '1-1 Goals and Motivation (15:46).mp4'
curl https://d396qusza40orc.cloudfront.net/comnetworks/lect/1-readings.pdf -o '1-2 Uses of Networks (17:12).pdf'
curl https://class.coursera.org/comnetworks-002/lecture/subtitles?q=11_en&format=srt -o '1-2 Uses of Networks (17:12).srt'
curl https://class.coursera.org/comnetworks-002/lecture/download.mp4?lecture_id=11 -o '1-2 Uses of Networks (17:12).mp4'
So far, we have successfully completed the goal of crawling Coursera course resources, and put the specific code on gist. In use, we only need to run the program and pass the course name as a parameter to the program (the course name here is not the complete name of the entire course, instead, the abbreviated name in the course introduction page address, such as the Computer Networks course, the course name is comnetworks-002 ).
In fact, this program can be seen as a simple small crawler. The following describes the concept of crawler.
Crawlers are not simple at all.
This is what we say on the wiki about crawler.
A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
The overall architecture of the crawler is as follows (the image is from wiki ):
In short, crawlers obtain the initial urls from Scheduler, download the corresponding page, store useful data, and analyze the links on the page, if you do not have any access, add it to Scheduler to wait for the page to be crawled.
Of course, there are some protocols to constrain the behavior norms of Crawlers. for example, many websites haverobots.txt
File to specify which content can be crawled and which cannot.
Each search engine has a powerful crawler program that extends its tentacles to all corners of the network, constantly collects useful information, and builds indexes. This search engine-level crawler is very complicated to implement, because the number of pages on the network is too large, but it is very difficult to traverse them, let alone analyze page information, and the index is created.
In practical applications, we only need to crawl a specific site and capture a small amount of resources, which is much simpler to implement. However, there are still many headaches. for example, many page elements are generated by javascript. at this time, we need a javascript engine to render the entire page and filter it out.
What's worse, many sites use some measures to prevent crawlers from crawling resources, such as limiting the number of visits to the same IP address for a period of time or limiting the interval between two operations, add the verification code. In most cases, we do not know how the server side prevents crawlers, so it is really difficult to make crawlers work.