[Python] web crawler (10): the whole process of the birth of a crawler (taking the performance point operation of Shandong University as an example)

Source: Internet
Author: User

Let's talk about our school website:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html

To query the score, You need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score.

Obviously, it is very troublesome to manually calculate the performance point. So we can use python as a crawler to solve this problem.



1. the eve of the decisive battle

Prepare the httpfox plug-in.

This is an HTTP analysis plug-in that analyzes the time, content, and cookies used by browsers for page requests and responses.

Take me as an example and install it on Firefox.

You can intuitively view the corresponding information.

Click Start to start detection, click STOP to pause detection, and click clear to clear the content.

Before use, click STOP to pause, and then click clear to clear the screen to ensure that the data obtained from accessing the current page is displayed.



2. go deep into the enemy's back

Next, I will go to Shandong University's score query website to see what information was sent when I logged on.

Go to the logon page, open httpfox, clear, and click Start to enable detection:

After entering the personal information, make sure that httpfox is enabled, and click confirm to submit the information to log on.

At this time, we can see that httpfox has detected three pieces of information:

Click the stop key to capture the data feedback after accessing the page, so that we can simulate login when crawling.

3. Ding jieniu

At first glance, we get three pieces of data, two of which are get and post, but we have no idea what they are and how to use them.

Therefore, we need to check the captured content one by one.

First look at the post information:


Since it is post information, we can simply look at postdata.

We can see that two data types are post, stuid and PWD.

The redirect to of type indicates that the bks_login2.loginmessage page is displayed after the post is completed.

From this we can see that this data is the form data submitted after clicking OK.

Click the cookie tab to view the cookie information:

Yes, I have received an account cookie and will automatically destroy it after the session ends.

What information is received after submission?

Let's take a look at the next two get data.

First, click the content tab to view the received content. Is there a cool-and-cool sensation -. -The HTML source code is undoubtedly exposed:


It seems that this is only the HTML source code of the page. Click cookie to view the cookie information:


Aha, the original HTML page content was received only after the cookie information was sent.

Let's take a look at the last received information:

In general, it should be a CSS file named style.css, which has little effect on us.




4. Calmly fight

Now that we know what data we have sent to the server and what data we have received, the basic process is as follows:

  • First, we post the student ID and password ---> and then return the cookie value.
  • Then send the cookie to the server ---> return the page information.
  • Obtain the data on the score page and use a regular expression to separate the score and credits and calculate the weighted average.

OK. It looks like a very simple sample. Next let's try it.

However, before the experiment, there was another problem that had not been solved, that is, where did the post data actually be sent?

Let's take a look at the original page:

Obviously, it is implemented using an HTML framework. That is to say, the address we see in the address bar is not the address of the form submitted on the right.

Then how can we get the real address -. -Right-click to view the page source code:

Well, that name = "w_right" is the login page we want.

The original website address is:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html

Therefore, the actual form submission address should be:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/xk_login.html

If you have entered the following information, the result is as follows:


The Course Selection System of Tsinghua University... In visual testing, my school simply borrowed it when I was too lazy to do the page .. The title cannot be changed...

However, this page is still not the page we need, because the page to which the post data is submitted should be the page submitted in the form action.

That is to say, we need to check the source code to know where the post data is actually sent:


Well, this is the address for submitting post data.

In the address bar, the complete address should be as follows:

Http://jwxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login

(The access method is simple. You can click the link in Firefox to view the link address)


5. Test the knife

The next task is to use python to simulate sending a post data and obtain the returned cookie value.

For more information about Cookie operations, see this blog:

Http://blog.csdn.net/wxg694175346/article/details/8925978

We first prepare a post data, then prepare a cookie for receiving, and then write the source code as follows:

#-*-Coding: UTF-8-*-# ------------------------------------- # program: Shandong University crawler # version: 0.1 # Author: Why # Date: 2013-07-12 # language: Python 2.7 # operation: enter the student ID and password # function: the weighted average value of the output score, that is, the score point # ------------------------------- import urllib import urllib2import cookielibcookie = cookielib. cookiejar () opener = urllib2.build _ opener (urllib2.httpcookieprocessor (cookie) # data to be post # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # Req = urllib2.request (url = 'HTTP: // javasxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ', Data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read ()

After that, let's look at the running effect:

OK. In this case, we can simulate successful login.


6. Days for stealing

The next task is to use crawlers to obtain students' scores.

Let's take a look at the source website.

After enabling httpfox, click to view the score and the following data is captured:


Click the first get data and view the content. You can see that content is the content of the obtained score.


For the obtained Page Link, right-click the page source code to view the elements, and you can see the page that jumps after clicking the Link (Firefox only needs to right-click and "view this framework ):


The link to view the score is as follows:

Http://jwxt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre


7. Everything is ready

Now everything is ready, so you just need to apply the link to the crawler to see if you can view the score page.

We can see from httpfox that we have sent a cookie to return the score information, so we use python to simulate sending a cookie to request the score information:

#-*-Coding: UTF-8-*-# ------------------------------------- # program: Shandong University crawler # version: 0.1 # Author: Why # Date: 2013-07-12 # language: Python 2.7 # operation: enter the student ID and password # function: the weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2import cookielib # initialize a cookiejar to process cookie information # cookie = cookielib. cookiejar () # create a new opener to use our cookiejar # opener = urllib2.build _ opener (urllib2.httpcookieprocessor (cookie) # post data # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # Req = urllib2.request (url = 'HTTP: // javasxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ', Data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read () # print the cookie value for item in COOKIE: Print 'cookie: Name = '+ item. name print 'cookie: value = '+ item. value # access this link # result = opener. open ('HTTP: // your xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre ') # print the returned content # print result. read ()

Press F5 to run it. Check the captured data:


In this case, there is no problem. Use a regular expression to process the data a little and retrieve the credits and corresponding scores.

8. Hand to hand

Such a large amount of HTML source code is obviously not conducive to our processing, the following uses regular expressions to extract necessary data.

For more information about regular expressions, see this blog:

Http://blog.csdn.net/wxg694175346/article/details/8929576

Let's take a look at the source code of the score:

In this case, it is easy to use regular expressions.


We will sort out the code a little and then use the regular expression to retrieve the data:

#-*-Coding: UTF-8-*-# ------------------------------------- # program: Shandong University crawler # version: 0.1 # Author: Why # Date: 2013-07-12 # language: Python 2.7 # operation: enter the student ID and password # function: Calculate the weighted average value of the output score, that is, the score point # Export Import urllib import urllib2import cookielibimport reclass sdu_spider: # declare related attributes def _ init _ (Self): Self. loginurl = 'HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login '# logon URL self. resulturl = 'HTTP :// Export xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre '# displays the result URL self. cookiejar = cookielib. cookiejar () # initialize a cookiejar to process cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # post data self. weights = [] # Storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.httpcookieprocessor (self. cookiejar) def sdu_init (Self): # initialize the link and obtain the cookie myrequest = Urllib2.request (url = self. loginurl, Data = self. postdata) # customize a request result = self. opener. open (myrequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resulturl) # access the score page and obtain the Score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. print_data (self. weights); self. print_data (self. points); # extract the content from the page code def deal_data (self, mypage): myitems = Re. findall ('<tr>. *? <P .*? <P .*? <P .*? <P .*? <P. *?> (.*?) </P> .*? <P .*? <P. *?> (.*?) </P> .*? </Tr> ', mypage, re. s) # obtain credits for item in myitems: Self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # extract the content from the page code def print_data (self, items): For item in items: Print item # Call myspider = sdu_spider () myspider. sdu_init ()

The level is limited. Regular Expressions are a little ugly ,. Running Effect

OK. The next step is the data processing problem ..




9. Triumph

The complete code is as follows. Now a complete crawler project is complete.

#-*-Coding: UTF-8-*-# ------------------------------------- # program: Shandong University crawler # version: 0.1 # Author: Why # Date: 2013-07-12 # language: Python 2.7 # operation: enter the student ID and password # function: the weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2import cookielibimport reimport stringclass sdu_spider: # declare related properties def _ init _ (Self): Self. loginurl = 'HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login '# logon URL self. resultu RL = 'HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre '# displays the result URL self. cookiejar = cookielib. cookiejar () # initialize a cookiejar to process cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # post data self. weights = [] # Storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.httpcookieprocessor (self. cookiejar) def sdu_init (Self): # initialize the link and obtain cooki E myrequest = urllib2.request (url = self. loginurl, Data = self. postdata) # customize a request result = self. opener. open (myrequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resulturl) # access the score page and obtain the Score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. calculate_date (); # extract the content from the page code def deal_data (self, mypage): myitems = Re. findall ('<tr>. *? <P .*? <P .*? <P .*? <P .*? <P. *?> (.*?) </P> .*? <P .*? <P. *?> (.*?) </P> .*? </Tr> ', mypage, re. s) # obtain credits for item in myitems: Self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # Calculate the score. If the score is not displayed or the score is excellent, Def calculate_date (Self) is not calculated ): point = 0.0 Weight = 0.0 for I in range (LEN (self. points): If (self. points [I]. isdigit (): Point + = string. atof (self. points [I]) * string. atof (self. weights [I]) weight + = string. atof (self. weights [I]) print point/weight # Call myspider = sdu_spider () myspider. sdu_init ()


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.