0 Basic Writing Python crawler crawler write full record _python

Source: Internet
Author: User
Tags html page prepare urlencode

Let's start with the website of our school:

Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

Query results need to log in, and then show the academic results, but only show results and no performance points, that is, weighted average points.

Obviously this is a very troublesome task to manually calculate the performance point. So we can use Python as a reptile to solve this problem.

1. The eve of the battle

First to prepare the tools: Httpfox plug-ins.

This is an HTTP protocol analysis plug-in, analysis of the page request and response time, content, and browser used cookies.

Take me as an example, installed in Firefox can be, the effect as shown:

Can be very intuitive to see the appropriate information.

Click Start to start detection, click Stop Pause Detection, and click Clear to erase content.

Before using, click Stop Pause and then click Clear to make sure that you see the data that you have accessed from the current page.

2. Deep behind Enemy lines

The following is to go to Shandong University score query website, look at the time of login, in the end sent the information.

First come to the login page, the Httpfox open, clear, click Start to open detection:

Enter your personal information, make sure the Httpfox is open, and then click OK to submit the information to achieve login.

As you can see at this point, Httpfox has detected three messages:

Click on the Stop button to ensure that the data is captured after the page is accessed so that we can simulate landing when we do the crawler.

3. Sunding

At first glance we got three data, and two of the get are post, but what they are and how they should be used, we don't know.

So, we need to look at what we've captured.

First look at the post information:


Since it is post information, we will look directly at the PostData.

You can see a total post of two data, Stuid and PWD.

And from the type of redirect to can be seen, Post finished jump to the Bks_login2.loginmessage page.

As a result, the data is the form data submitted after clicking OK.

Click on the Cookie tab to see the cookie information:


Yes, a cookie for account was received and automatically destroyed after session.

So what information did you receive after you submitted it?

Let's take a look at the two get data that follows.

First, we click on the Content tab to see what we receive, is there a pleasure--. The-html source code is exposed to no doubt:


It appears that this is just the HTML source of the page, click on the cookie to view the information about cookies:



Aha, the original HTML page's content was received after the cookie message was sent.

Let's take a look at the last message received:

Roughly looked at should be just a CSS file called Style.css, not much of a role for us.

4. Calm and challenge

Now that we know what data we send to the server and what data we receive, the basic process is as follows:

First, we post the number and password---> then return the value of the cookie and then send a cookie to the server---> return page information. Get the data from the score page and use regular expressions to separate the scores and credits separately and calculate the weighted average number.

OK, it looks like a very simple kind of paper. Let's try it down there.

But before the experiment, there is still a problem unresolved, which is the post data sent to where?

Take another look at the original page:

It's obviously implemented with an HTML framework, which means that the address we see in the address bar is not the address of the form submitted to the right.

So how do you get the real address--. -Right-click to view page source code:

Yes, that name= "W_right" is the login page we want.

The original address of the website is:

Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

So, the address of the real form submission should be:

Http://jwxt.sdu.edu.cn:7777/zhxt_bks/xk_login.html

Enter a look, sure enough:


Rely on actually is Tsinghua University's elective system ... The visual inspection is my school does not bother to make the page to borrow directly. The result doesn't even change the title ...

But this page is still not the page we need, because our post data is submitted to the page that should be submitted to the form's action.

In other words, we need to look at the source to see where the post data is sent:


Well, this is the address for submitting the post data.

To the Address bar, the full address should read as follows:

Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login

(The way to get it is simple, click the link directly in Firefox browser to see the link's address)

5. Small test

The next task is to use Python to simulate sending a post's data and fetch the returned cookie value.

For cookies, take a look at this blog post:

Http://www.jb51.net/article/57144.htm

We first prepare a post data, then prepare a cookie to receive, and then write the source code as follows:

#-*-Coding:utf-8-*-
#---------------------------------------
# program: The crawler of Shandong University
# Version: 0.1
# Author: why
# Date: 2013-07-12
# language: Python 2.7
# operation: input number and password
# function: The weighted average of output score is the performance point
#---------------------------------------

Import Urllib2
Import Cookielib

Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))
#需要POST的数据 #



})
#自定义一个请求 #


data = PostData
)
#访问该链接 #
result = Opener.open (req)
#打印返回的内容 #

After that, look at the effect of the operation:


OK, and so on, we succeeded in simulating the landing.

6. Steal

The next task is to use the crawler to get the students ' grades.

Take a look at the source site again.

After opening the Httpfox, click to view the results and find the following data captured:


By clicking on the first get data, you can see that content is the content of the results obtained.

and get to the page link, from the page source code in the right click to view elements, you can see click on the link after the jump page (Firefox browser only need to right-click, "View this frame", you Can):


So you can get a link to see the results as follows:

Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre

7. Everything is ready.

Everything is ready now, so just apply the link to the crawler and see if you can check the page for the results.

As you can see from Httpfox, we send a cookie to return the score, so we use Python to simulate the sending of a cookie to request results:

#-*-Coding:utf-8-*-
#---------------------------------------
# program: The crawler of Shandong University
# Version: 0.1
# Author: why
# Date: 2013-07-12
# language: Python 2.7
# operation: input number and password
# function: The weighted average of output score is the performance point
#---------------------------------------
Import Urllib
Import Urllib2
Import Cookielib
#初始化一个CookieJar来处理Cookie的信息 #
Cookie = Cookielib. Cookiejar ()
#创建一个新的opener来使用我们的CookieJar #
Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))
#需要POST的数据 #
Postdata=urllib.urlencode ({
' Stuid ': ' 201100300428 ',
' pwd ': ' 921030 '
})
#自定义一个请求 #
req = Urllib2. Request (
url = ' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ',
data = PostData
)
#访问该链接 #
result = Opener.open (req)
#打印返回的内容 #
Print Result.read ()
#打印cookie的值
For item in Cookie:
print ' Cookie:name = ' +item.name
print ' Cookie:value = ' +item.value

#访问该链接 #
result = Opener.open (' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre ')
#打印返回的内容 #
Print Result.read ()

Press the F5 to run to see the captured data:


Since then there is no problem, the use of regular expressions to slightly deal with the data, take out credits and corresponding points on it.

8. Extremely easy

Such a large stack of HTML source code is obviously not good for us to deal with, the following to use regular expressions to pull out the necessary data.

For a tutorial on regular expressions take a look at this blog post:

Http://www.jb51.net/article/57150.htm

Let's take a look at the results of the source code:


In that case, regular expressions are a breeze.

We sort the code a little bit and then use the regular to get the data out:

#-*-Coding:utf-8-*-
#---------------------------------------
# program: The crawler of Shandong University
# Version: 0.1
# Author: why
# Date: 2013-07-12
# language: Python 2.7
# operation: input number and password
# function: The weighted average of output score is the performance point
#---------------------------------------
Import Urllib
Import Urllib2
Import Cookielib
Import re
Class Sdu_spider:
# Declare related properties
def __init__ (self):
Self.loginurl = ' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ' # login URL
Self.resulturl = ' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre ' # Display the URL of the score
Self.cookiejar = Cookielib. Cookiejar () # Initializes a cookiejar to handle the cookie information
Self.postdata=urllib.urlencode ({' Stuid ': ' 201100300428 ', ' pwd ': ' 921030 '}) # Post data
Self.weights = [] #存储权重, i.e. credits
Self.points = [] #存储分数, the result
Self.opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Self.cookiejar))
def sdu_init (self):
# Initialize link and get cookie
Myrequest = Urllib2. Request (url = Self.loginurl,data = self.postdata) # Customize a demand
result = Self.opener.open (myrequest) # Accessing the login page to get the value of the required cookie
result = Self.opener.open (self.resulturl) # Access score page, get score data
# Print the Returned content
# Print Result.read ()
Self.deal_data (Result.read (). Decode (' GBK '))
Self.print_data (self.weights);
Self.print_data (self.points);
# Pull content out of the page code
def deal_data (self,mypage):
MyItems = Re.findall (' <TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?> (. *?) </p>.*?<p.*?<p.*?> (. *?) </p>.*?</TR> ', Mypage,re. S) #获取到学分
For item in myitems:
Self.weights.append (Item[0].encode (' GBK '))
Self.points.append (Item[1].encode (' GBK '))

# Pull content out of the page code
def print_data (Self,items):
For item in items:
Print Item
#调用
Myspider = Sdu_spider ()
Myspider.sdu_init ()

The level is limited, it is a bit ugly,. The effect of the operation is as shown in figure:

OK, the next thing is the data processing problem.

9. Triumphant return

The complete code is as follows, so a complete reptile project is completed.

#-*-Coding:utf-8-*-
#---------------------------------------
# program: The crawler of Shandong University
# Version: 0.1
# Author: why
# Date: 2013-07-12
# language: Python 2.7
# operation: input number and password
# function: The weighted average of output score is the performance point
#---------------------------------------
Import Urllib
Import Urllib2
Import Cookielib
Import re
Import string
Class Sdu_spider:
# Declare related properties
def __init__ (self):
Self.loginurl = ' http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login ' # login URL
Self.resulturl = ' Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre ' # Display the URL of the score
Self.cookiejar = Cookielib. Cookiejar () # Initializes a cookiejar to handle the cookie information
Self.postdata=urllib.urlencode ({' Stuid ': ' 201100300428 ', ' pwd ': ' 921030 '}) # Post data
Self.weights = [] #存储权重, i.e. credits
Self.points = [] #存储分数, the result
Self.opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Self.cookiejar))
def sdu_init (self):
# Initialize link and get cookie
Myrequest = Urllib2. Request (url = Self.loginurl,data = self.postdata) # Customize a demand
result = Self.opener.open (myrequest) # Accessing the login page to get the value of the required cookie
result = Self.opener.open (self.resulturl) # Access score page, get score data
# Print the Returned content
# Print Result.read ()
Self.deal_data (Result.read (). Decode (' GBK '))
Self.calculate_date ();
# Pull content out of the page code
def deal_data (self,mypage):
MyItems = Re.findall (' <TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?> (. *?) </p>.*?<p.*?<p.*?> (. *?) </p>.*?</TR> ', Mypage,re. S) #获取到学分
For item in myitems:
Self.weights.append (Item[0].encode (' GBK '))
Self.points.append (Item[1].encode (' GBK '))
#计算绩点, if the results have not come out, or the results are excellent good, do not calculate the results
def calculate_date (self):
Point = 0.0
Weight = 0.0
For I in range (len (self.points)):
if (Self.points[i].isdigit ()):
Point = = String.atof (Self.points[i]) *string.atof (self.weights[i))
Weight + + string.atof (Self.weights[i])
Print Point/weight
#调用
Myspider = Sdu_spider ()
Myspider.sdu_init ()

The above is the entire process of the crawler's birth detailed records, there is no magic hurrying feet?? Haha, a joke, need a friend of the reference bar, free expansion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.