To query the score, you need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score. Let's talk about our school website:
Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html
To query the score, you need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score.
We first prepare a POST data, then prepare a cookie for receiving, and then write the source code as follows:
#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib cookie = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) # data to be POST # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # req = urllib2.Request (url = 'http: // javasxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ', data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read ()
After that, let's look at the running effect:
7. everything is ready
Now everything is ready, so you just need to apply the link to the crawler to see if you can view the score page.
We can see from httpfox that we have sent a cookie to return the score information, so we use python to simulate sending a cookie to request the score information:
#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib # initialize a CookieJar to process Cookie information # cookie = cookielib. cookieJar () # Create a new opener to use our CookieJar # opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) # POST data # postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # customize a request # req = urllib2.Request (url =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ', Data = postdata) # access this link # result = opener. open (req) # print the returned content # print result. read () # print the cookie value for item in cookie: print 'cookie: Name = '+ item. name print 'cookie: Value = '+ item. value # access this link # result = opener. open (' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre ') # Print the returned content # print result. read ()
Press F5 to run it. check the captured data:
We will sort out the code a little and then use the regular expression to retrieve the data:
#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib import re class SDU_Spider: # declare related properties def _ init _ (self): self. loginUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login '# Logon url self. resultUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre '# Display the result url self. cookieJar = cookielib. cookieJar () # initialize a CookieJar to process Cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST data self. weights = [] # storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar) def sdu_init (self): # Initialize the link and obtain the cookie myRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resultUrl) # access the score page and obtain the score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. print_data (self. weights); self. print_data (self. points); # extract the content from the page code def deal_data (self, myPage): myItems = re. findall ('.*?
(.*?)
.*?
(.*?)
.*?', MyPage, re. s) # obtain credits for item in myItems: self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # extract the content from the page code def print_data (self, items): for item in items: print item # Call mySpider = SDU_Spider () mySpider. sdu_init ()
The level is limited. Regular expressions are a little ugly ,. Running effect
The complete code is as follows. now a complete crawler project is complete.
#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Shandong University crawler # Version: 0.1 # Author: why # Date: 2013-07-12 # Language: Python 2.7 # operation: enter the student ID and password # function: The weighted average value of the output score, that is, the score point # ----------------------------- import urllib import urllib2 import cookielib import re import string class SDU_Spider: # declare related properties def _ init _ (self): self. loginUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login '# Logon url self. resultUrl =' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre '# Display the result url self. cookieJar = cookielib. cookieJar () # initialize a CookieJar to process Cookie information self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST data self. weights = [] # storage weight, that is, credit self. points = [] # store the score, that is, the score self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar) def sdu_init (self): # Initialize the link and obtain the cookie myRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value result = self. opener. open (self. resultUrl) # access the score page and obtain the score data # print the returned content # print result. read () self. deal_data (result. read (). decode ('gbk') self. calculate_date (); # extract the content from the page code def deal_data (self, myPage): myItems = re. findall ('.*?
(.*?)
.*?
(.*?)
.*?', MyPage, re. s) # obtain credits for item in myItems: self. weights. append (item [0]. encode ('gbk') self. points. append (item [1]. encode ('gbk') # calculate the score. if the score is not displayed or the score is excellent, def calculate_date (self) is not calculated ): point = 0.0 weight = 0.0 for I in range (len (self. points): if (self. points [I]. isdigit (): point + = string. atof (self. points [I]) * string. atof (self. weights [I]) weight + = string. atof (self. weights [I]) print point/weight # Call mySpider = SDU_Spider () mySpider. sdu_init ()
The above is the [Python] web crawler (10): The whole process of the birth of a crawler (taking the performance point calculation of Shandong University as an example). For more information, see The PHP Chinese website (www.php1.cn )!