[Python] web crawler (10): The whole process of the birth of a crawler (taking the performance point operation of Shandong University as an example)

Source: Internet
Author: User
To query the score, you need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score. Let's talk about our school website:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html

To query the score, you need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score.

We first prepare a POST data, then prepare a cookie for receiving, and then write the source code as follows:

#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib cookie = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) # data to be POST # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # req = urllib2.Request (url = 'http: // javasxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ', data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read ()

After that, let's look at the running effect:

7. everything is ready

Now everything is ready, so you just need to apply the link to the crawler to see if you can view the score page.

We can see from httpfox that we have sent a cookie to return the score information, so we use python to simulate sending a cookie to request the score information:

#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib # initialize a CookieJar to process Cookie information # cookie = cookielib. cookieJar () # Create a new opener to use our CookieJar # opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) # POST data # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # req = urllib2.Request (url =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ', Data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read () # print the cookie value for item in cookie: print 'cookie: Name = '+ item. name print 'cookie: Value = '+ item. value # access this link # result = opener. open (' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre ') # Print the returned content # print result. read ()

Press F5 to run it. check the captured data:

We will sort out the code a little and then use the regular expression to retrieve the data:

#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib import re class SDU_Spider: # declare related properties def _ init _ (self): self. loginUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login '# Logon url self. resultUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre '# Display the result url self. cookieJar = cookielib. cookieJar () # initialize a CookieJar to process Cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST data self. weights = [] # storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar) def sdu_init (self): # Initialize the link and obtain the cookie myRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resultUrl) # access the score page and obtain the score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. print_data (self. weights); self. print_data (self. points); # extract the content from the page code def deal_data (self, myPage): myItems = re. findall ('.*?
 
  
(.*?)
  

.*? (.*?)

.*?', MyPage, re. s) # obtain credits for item in myItems: self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # extract the content from the page code def print_data (self, items): for item in items: print item # Call mySpider = SDU_Spider () mySpider. sdu_init ()

The level is limited. Regular expressions are a little ugly ,. Running effect

The complete code is as follows. now a complete crawler project is complete.

#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib import re import string class SDU_Spider: # declare related properties def _ init _ (self): self. loginUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login '# Logon url self. resultUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre '# Display the result url self. cookieJar = cookielib. cookieJar () # initialize a CookieJar to process Cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST data self. weights = [] # storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar) def sdu_init (self): # Initialize the link and obtain the cookie myRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resultUrl) # access the score page and obtain the score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. calculate_date (); # extract the content from the page code def deal_data (self, myPage): myItems = re. findall ('.*?
 
  
(.*?)
  

.*? (.*?)

.*?', MyPage, re. s) # obtain credits for item in myItems: self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # calculate the score. if the score is not displayed or the score is excellent, def calculate_date (self) is not calculated ): point = 0.0 weight = 0.0 for I in range (len (self. points): if (self. points [I]. isdigit (): point + = string. atof (self. points [I]) * string. atof (self. weights [I]) weight + = string. atof (self. weights [I]) print point/weight # Call mySpider = SDU_Spider () mySpider. sdu_init ()

The above is the [Python] web crawler (10): The whole process of the birth of a crawler (taking the performance point calculation of Shandong University as an example). For more information, see The PHP Chinese website (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.