The previous nine articles from the basis to the writing have done a detailed introduction, the tenth is a perfect, then we will be detailed records of a crawler how to write a step by step, you crossing can see carefully
First of all, the website of our school:
Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html
Query results need to log in, and then show the results of each subject, but only show the results and no performance points, that is, weighted average score.
Obviously, it's a very troublesome thing to manually calculate the points of merit. So we can use Python to do a crawler to solve this problem.
1. Eve of Showdown
First, prepare the tool: Httpfox plugin.
This is an HTTP protocol analysis plugin that analyzes the time and content of page requests and responses, as well as the cookies used by the browser.
Take me for example, installed on Firefox, the effect
It is very intuitive to view the appropriate information.
Click Start to start detection, click Stop Pause Detection, click Clear to erase the content.
In general, before using, click Stop Pause, and then click Clear to clear the screen to ensure that you see the data obtained from accessing the current page.
2. Deep behind Enemy lines
The following will go to Shandong University's results query site, look at the time of login, in the end sent the information.
First come to the login page, open the Httpfox, clear, click Start to open the detection:
Enter your personal information, make sure the Httpfox is turned on, and click OK to submit the information for login.
This time you can see that Httpfox detected three messages:
At this point, click the Stop key to ensure that the data captured is the feedback after accessing the page, so that when we do the crawler simulation landing use.
3. Discovering
At first glance we get three data, two are get the one is post, but what exactly is what, should use, we still know nothing.
So, we need to look at the captured content in a minute.
First look at the post information:
Since it is the post information, we can directly see PostData.
You can see the total post two data, Stuid and PWD.
And from the type of redirect to can be seen, Post finished to jump to the Bks_login2.loginmessage page.
As you can see, this data is the form data submitted after clicking OK.
Click on the cookie tag to see the cookie information:
Yes, a cookie is received for an account and is automatically destroyed after the session ends.
What information did you receive after submission?
Let's take a look at the two get data behind.
First, we click on the Content tab to see what is received, is there a kind of the pleasure of being eaten alive? -html source of exposure is undoubtedly:
It appears that this is just the HTML source of the page, click on the cookie to see information about the cookie:
Aha, the content of the original HTML page was received only after the cookie message was sent.
Then take a look at the last message received:
Look at a bit of it should be just a CSS file called style.css, it does not have much effect on us.
4. Calm down
Now that we know what data we send to the server, we also know what data we have received, the basic process is as follows:
First, we post the number and password---> then return the value of the cookie and then send a cookie to the server---> return page information. Get the data to the score page and use regular expressions to take the scores and credits out separately and calculate the weighted average.
OK, it looks like a very simple kind of paper. Let's try it down there.
But before the experiment, there was one more problem, that is, where is the post data sent to?
Then look at the original page:
is obviously implemented with an HTML framework, that is, the address that we see in the address bar is not the address on the right to submit the form.
So how do you get the real address-. -Right click to view the page source code:
Yes, that name= "W_right" is the login page we want.
The original address of the website is:
Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html
So, the address of the real form submission should be:
Http://jwxt.sdu.edu.cn:7777/zhxt_bks/xk_login.html
Input a look, sure enough:
Depend on incredibly is Tsinghua University's elective system ... Visual inspection is that our school is too lazy to do the page to borrow directly. The result even the title does not change a bit ...
But this page is still not the page we need because our post data is submitted to the page that should be submitted to in the action of form form.
In other words, we need to look at the source code to see where the post data is sent:
Well, visually this is the address where the post data is submitted.
In the address bar, the complete address should be as follows:
Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login
(The way to get it is simple, just click that link in Firefox to see the link's address.)
5. Small trial Sledgehammer
The next task is to use Python to simulate sending a post data and fetch the returned cookie value.
The operation of cookies can be seen in this blog post:
Http://www.jb51.net/article/57144.htm
We first prepare a post data, then prepare a cookie to receive, and then write the source code as follows:
#-*-Coding:utf-8-*-#---------------------------------------# Program: Shandong University Crawler # version: 0.1# Author: why# Date: 2013-07-12# language: Python 2.7# Action: Enter the number and password # function: The weighted average of the output score is the merit point #-------------------------------------- -import urllib Import urllib2import Cookielibcookie = Cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #需要POST的数据 #postdata=urllib.urlencode ({ ' stuid ': ' 201100300428 ', ' pwd ': ' 921030 ' }) #自定义一个请求 #req = Urllib2. Request ( url = ' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ', data = postdata) #访问该链接 #result = Opener.open (req) #打印返回的内容 #print result.read ()
After that, look at the effect of the run:
OK, and so on, we even succeeded in simulating the landing.
6. Bogus
The next task is to use crawlers to get students ' grades.
Then take a look at the source site.
After you turn on Httpfox, click to view your results and find the following data captured:
Click on the first get data to see what content is the content of the score you get.
and get to the page link, from the page source code right click to view the elements, you can see the link after the jump page (Firefox browser only need to right click, "View this frame", you Can):
The following links can be obtained to see the results:
Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre
7. Everything is ready
Now everything is ready, so just apply the link to the crawler and see if you can see the results page.
As you can see from Httpfox, we send a cookie to return information about the score, so we use Python to simulate the sending of a cookie to request information about the score:
#-*-Coding:utf-8-*-#---------------------------------------# Program: Shandong University Crawler # version: 0.1# Author: why# Date: 2013-07-12# language: Python 2.7# Action: Enter the number and password # function: The weighted average of the output score is the merit point #-------------------------------------- -import urllib Import urllib2import cookielib# Initializes a cookiejar to process the cookie information #cookie = Cookielib. Cookiejar () #创建一个新的opener来使用我们的CookieJar #opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #需要POST的数据 #postdata=urllib.urlencode ({ ' stuid ': ' 201100300428 ', ' pwd ': ' 921030 ' }) #自定义一个请求 #req = Urllib2. Request ( url = ' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ', data = postdata) #访问该链接 #result = Opener.open (req) #打印返回的内容 #print result.read () #打印cookie的值for item in Cookie: print ' cookie:name = ' +item.name print ' Cookie:value = ' +item.value #访问该链接 #result = Opener.open (' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/ Bkscjcx.curscopre ') #打印返回的内容 #print result.read ()
Press F5 to run and look at the captured data:
Since this is no problem, use regular expressions to handle the data a little bit, take out the credits and the corresponding scores.
8. Extremely easy
So a lot of HTML source code is obviously not conducive to our processing, the following to use regular expressions to key out the necessary data.
A tutorial on regular expressions can look at this blog post:
Http://www.jb51.net/article/57150.htm
Let's take a look at the source of results:
That being the case, it's easy to use regular expressions.
We'll tidy up the code a little bit and then use the regular to extract the data:
#-*-Coding:utf-8-*-#---------------------------------------# Program: Shandong University Crawler # version: 0.1# Author: why# Date: 2013-07-12# language: Py Thon 2.7# Operation: Enter the number and password # function: The weighted average of output scores is the merit point #---------------------------------------import urllib Import Urllib2import Co Okielibimport Reclass Sdu_spider: # affirms the associated attribute Def __init__ (self): Self.loginurl = ' http://jwxt.sdu.edu. Cn:7777/pls/wwwbks/bks_login2.login ' # login URL self.resulturl = ' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.cu Rscopre ' # shows the URL of the score Self.cookiejar = Cookielib. Cookiejar () # Initializes a cookiejar to process the cookie information Self.postdata=urllib.urlencode ({' Stui d ': ' 201100300428 ', ' pwd ': ' 921030 ') # post data self.weights = [] #存储权重, that is, credits self.points = [] #存储分数, also is the result Self.opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Self.cookiejar)) def sdu_init (self): # Initialize link and get cookie myrequest = urllib2. Request (url = self.loginurl,data = Self.postdatA) # Customize a request for result = Self.opener.open (myrequest) # To access the login page, obtain the value of the cookie that is required to result = Self.opener. Open (Self.resulturl) # Access the score page, get the results of the data # Print the Returned content # Result.read () Self.deal_data (Result.rea D (). Decode (' GBK ')) Self.print_data (self.weights); Self.print_data (self.points); # keying content out of the page code def deal_data (self,mypage): MyItems = Re.findall (' <tr>.*?<p.*?<p.*?<p.*?<p. *?<p.*?> (. *?) </p>.*?<p.*?<p.*?> (. *?) </p>.*?</TR> ', Mypage,re. S) #获取到学分 for item in MyItems:self.weights.append (Item[0].encode (' GBK ')) Self.points.app End (Item[1].encode (' GBK ')) # keying The content out of the page code def print_data (self,items): For item in items: Print item# Call Myspider = Sdu_spider () myspider.sdu_init ()
The level is limited, and the regular is a little ugly. Effect of running
OK, the next thing is just the data processing problem.
9. Triumph and Return
The complete code is as follows, and a complete crawler project is completed.
#-*-Coding:utf-8-*-#---------------------------------------# Program: Shandong University Crawler # version: 0.1# Author: why# Date: 2013-07-12# language: Py Thon 2.7# Operation: Enter the number and password # function: The weighted average of output scores is the merit point #---------------------------------------import urllib Import Urllib2import Co Okielibimport reimport stringclass Sdu_spider: # Affirming related properties def __init__ (self): Self.loginurl = '/http/ Jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ' # login URL self.resulturl = ' HTTP://JWXT.SDU.EDU.CN:7777/PLS/WWWB Ks/bkscjcx.curscopre ' # shows the URL of the score Self.cookiejar = Cookielib. Cookiejar () # Initializes a cookiejar to process the cookie information Self.postdata=urllib.urlencode ({' Stui d ': ' 201100300428 ', ' pwd ': ' 921030 ') # post data self.weights = [] #存储权重, that is, credits self.points = [] #存储分数, also is the result Self.opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Self.cookiejar)) def sdu_init (self): # Initialize link and get cookie myrequest = urllib2. Request (url = self.loginurl,data = Self.postdata) # Customizing a request result = Self.opener.open (myrequest) # Access to the login page, get to the value of the cookie required result = Self.opener.open (Self.resulturl) # To access the score page, get the results of the data # Print the returned content # printing result.read () Self.deal_da Ta (Result.read () decode (' GBK ')) self.calculate_date (); # keying content out of the page code def deal_data (self,mypage): MyItems = Re.findall (' <tr>.*?<p.*?<p.*?<p.*?<p. *?<p.*?> (. *?) </p>.*?<p.*?<p.*?> (. *?) </p>.*?</TR> ', Mypage,re. S) #获取到学分 for item in MyItems:self.weights.append (Item[0].encode (' GBK ')) Self.points.app End (Item[1].encode (' GBK ')) #计算绩点, if the score has not come out, or the score is good, do not calculate the result def calculate_date (self): point = 0.0 Weig HT = 0.0 for i in range (len (self.points)): if (Self.points[i].isdigit ()): Point + = string. Atof (Self.points[i]) *string.atof (self.weights[i]) weight + = String.atof (Self.weights[i]) Print point/weight# Call Myspider = Sdu_spider () myspider.sdu_init ()
The above is the whole process of the reptile birth detailed records, there is no magic bright?? Haha, open a joke, need a friend to refer to the next, free extension