Full record of python crawler writing without basic writing, python Crawler

Last Update:2014-11-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's talk about our school website:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html

To query the score, You need to log on and then display the score of each discipline, but only the score is displayed without the score, that is, the weighted average score.

Obviously, it is very troublesome to manually calculate the performance point. So we can use python as a crawler to solve this problem.

1. the eve of the decisive battle

Prepare the HttpFox plug-in.

This is an http analysis plug-in that analyzes the time, content, and cookies used by browsers for page requests and responses.

Take me as an example and install it on Firefox.

You can intuitively view the corresponding information.

Click start to start detection, click stop to pause detection, and click clear to clear the content.

Before use, click stop to pause, and then click clear to clear the screen to ensure that the data obtained from accessing the current page is displayed.

2. go deep into the enemy's back

Next, I will go to Shandong University's score query website to see what information was sent when I logged on.

Go to the logon page, open httpfox, clear, and click start to enable detection:

After entering the personal information, make sure that httpfox is enabled, and click confirm to submit the information to log on.

At this time, we can see that httpfox has detected three pieces of information:

Click the stop key to capture the data feedback after accessing the page, so that we can simulate login when crawling.

3. Ding jieniu

At first glance, we GET three pieces of data, two of which are GET and POST, but we have no idea what they are and how to use them.

Therefore, we need to check the captured content one by one.

First look at the POST information:

Since it is POST information, we can simply look at PostData.

We can see that two data types are POST, stuid and pwd.

The Redirect to of Type indicates that the bks_login2.loginmessage page is displayed after the POST is completed.

From this we can see that this data is the form data submitted after clicking OK.

Click the cookie tab to view the cookie information:

Yes, I have received an ACCOUNT cookie and will automatically destroy it after the session ends.

What information is received after submission?

Let's take a look at the next two GET data.

First, click the content tab to view the received content. Is there a cool-and-cool sensation -. -The HTML source code is undoubtedly exposed:

It seems that this is only the html source code of the page. Click cookie to view the cookie information:

Aha, the original html page content was received only after the cookie information was sent.

Let's take a look at the last received information:

In general, it should be a css file named style.css, which has little effect on us.

4. Calmly fight

Now that we know what data we have sent to the server and what data we have received, the basic process is as follows:

First, we POST the student ID and password ---> then return the cookie value and then send the cookie to the server ---> return the page information. Obtain the data on the score page and use a regular expression to separate the score and credits and calculate the weighted average.

OK. It looks like a very simple sample. Next let's try it.

However, before the experiment, there was another problem that had not been solved, that is, where did the POST data actually be sent?

Let's take a look at the original page:

Obviously, it is implemented using an html framework. That is to say, the address we see in the address bar is not the address of the form submitted on the right.

Then how can we get the real address -. -Right-click to view the page source code:

Well, that name = "w_right" is the login page we want.

The original website address is:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/zhxt_bks.html

Therefore, the actual form submission address should be:

Http://jwxt.sdu.edu.cn: 7777/zhxt_bks/xk_login.html

If you have entered the following information, the result is as follows:

The Course Selection System of Tsinghua University... In visual testing, my school simply borrowed it when I was too lazy to do the page .. The title cannot be changed...

However, this page is still not the page we need, because the page to which the POST data is submitted should be the page submitted in the form ACTION.

That is to say, we need to check the source code to know where the POST data is actually sent:

Well, this is the address for submitting POST data.

In the address bar, the complete address should be as follows:

Http://jwxt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login

(The access method is simple. You can click the link in Firefox to view the link address)

5. Test the knife

The next task is to use python to simulate sending a POST data and obtain the returned cookie value.

For more information about cookie operations, see this blog:

Http://www.bkjia.com/article/57144.htm

We first prepare a POST data, then prepare a cookie for receiving, and then write the source code as follows:

#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: crawlers from Shandong University
# Version 0.1
# Author: why
# Date: 2013-07-12
# Programming language: Python 2.7
# Operation: Enter the student ID and Password
# Function: the weighted average value of the output score, that is, the score point.
#---------------------------------------
Import urllib
Import urllib2
Import cookielib
Cookie = cookielib. CookieJar ()
Opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie ))
# Data to be POST #
Postdata = urllib. urlencode ({
'Stuid': '123 ',
'Pwd': '123'
})
# Customizing a request #
Req = urllib2.Request (
Url = 'HTTP: // your xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ',
Data = postdata
)
# Access this link #
Result = opener. open (req)
# Print the returned content #
Print result. read ()

After that, let's look at the running effect:

OK. In this case, we can simulate successful login.

6. Days for stealing

The next task is to use crawlers to obtain students' scores.

Let's take a look at the source website.

After enabling HTTPFOX, click to view the score and the following data is captured:

Click the first GET data and view the Content. You can see that Content is the Content of the obtained score.

For the obtained Page Link, right-click the page source code to view the elements, and you can see the page that jumps after clicking the Link (Firefox only needs to right-click and "view this framework ):

The link to view the score is as follows:

Http://jwxt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre

7. Everything is ready

Now everything is ready, so you just need to apply the link to the crawler to see if you can view the score page.

We can see from httpfox that we have sent a cookie to return the score information, so we use python to simulate sending a cookie to request the score information:

#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: crawlers from Shandong University
# Version 0.1
# Author: why
# Date: 2013-07-12
# Programming language: Python 2.7
# Operation: Enter the student ID and Password
# Function: the weighted average value of the output score, that is, the score point.
#---------------------------------------
Import urllib
Import urllib2
Import cookielib
# Initialize a CookieJar to process Cookie information #
Cookie = cookielib. CookieJar ()
# Create a new opener to use CookieJar #
Opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie ))
# Data to be POST #
Postdata = urllib. urlencode ({
'Stuid': '123 ',
'Pwd': '123'
})
# Customizing a request #
Req = urllib2.Request (
Url = 'HTTP: // your xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login ',
Data = postdata
)
# Access this link #
Result = opener. open (req)
# Print the returned content #
Print result. read ()
# Print the cookie value
For item in cookie:
Print 'cookie: Name = '+ item. name
Print 'cookie: Value = '+ item. value

# Access this link #
Result = opener. open ('HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre ')
# Print the returned content #
Print result. read ()

Press F5 to run it. Check the captured data:

In this case, there is no problem. Use a regular expression to process the data a little and retrieve the credits and corresponding scores.

8. Hand to hand

Such a large amount of html source code is obviously not conducive to our processing, the following uses regular expressions to extract necessary data.

For more information about regular expressions, see this blog:

Http://www.bkjia.com/article/57150.htm

Let's take a look at the source code of the score:

In this case, it is easy to use regular expressions.

We will sort out the code a little and then use the regular expression to retrieve the data:

#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: crawlers from Shandong University
# Version 0.1
# Author: why
# Date: 2013-07-12
# Programming language: Python 2.7
# Operation: Enter the student ID and Password
# Function: the weighted average value of the output score, that is, the score point.
#---------------------------------------
Import urllib
Import urllib2
Import cookielib
Import re
Class SDU_Spider:
# Declaring related attributes
Def _ init _ (self ):
Self. loginUrl = 'HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login' # login url
Self. resultUrl = 'HTTP: // your xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre' # The url for displaying the score
Self. cookieJar = cookielib. CookieJar () # initialize a CookieJar to process Cookie information.
Self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST Data
Self. weights = [] # Storage weight, that is, credits
Self. points = [] # store scores, that is, scores
Self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar ))
Def sdu_init (self ):
# Initializing links and obtaining cookies
MyRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request
Result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value.
Result = self. opener. open (self. resultUrl) # access the score page and obtain the Score data
# Print the returned content
# Print result. read ()
Self. deal_data (result. read (). decode ('gbk '))
Self. print_data (self. weights );
Self. print_data (self. points );
# Extract the content from the page code
Def deal_data (self, myPage ):
MyItems = re. findall ('<TR> .*? <P .*? <P .*? <P .*? <P .*? <P. *?> (.*?) </P> .*? <P .*? <P. *?> (.*?) </P> .*? </TR> ', myPage, re. S) # earned credits
For item in myItems:
Self. weights. append (item [0]. encode ('gbk '))
Self. points. append (item [1]. encode ('gbk '))

# Extract the content from the page code
Def print_data (self, items ):
For item in items:
Print item
# Call
MySpider = SDU_Spider ()
MySpider. sdu_init ()

The level is limited. Regular Expressions are a little ugly ,. Running Effect

OK. The next step is the data processing problem ..

9. Triumph

The complete code is as follows. Now a complete crawler project is complete.

#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: crawlers from Shandong University
# Version 0.1
# Author: why
# Date: 2013-07-12
# Programming language: Python 2.7
# Operation: Enter the student ID and Password
# Function: the weighted average value of the output score, that is, the score point.
#---------------------------------------
Import urllib
Import urllib2
Import cookielib
Import re
Import string
Class SDU_Spider:
# Declaring related attributes
Def _ init _ (self ):
Self. loginUrl = 'HTTP: // export xt.sdu.edu.cn: 7777/pls/wwwbks/bks_login2.login' # login url
Self. resultUrl = 'HTTP: // your xt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx. curscopre' # The url for displaying the score
Self. cookieJar = cookielib. CookieJar () # initialize a CookieJar to process Cookie information.
Self. postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # POST Data
Self. weights = [] # Storage weight, that is, credits
Self. points = [] # store scores, that is, scores
Self. opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (self. cookieJar ))
Def sdu_init (self ):
# Initializing links and obtaining cookies
MyRequest = urllib2.Request (url = self. loginUrl, data = self. postdata) # customize a request
Result = self. opener. open (myRequest) # access the logon page and obtain the required cookie value.
Result = self. opener. open (self. resultUrl) # access the score page and obtain the Score data
# Print the returned content
# Print result. read ()
Self. deal_data (result. read (). decode ('gbk '))
Self. calculate_date ();
# Extract the content from the page code
Def deal_data (self, myPage ):
MyItems = re. findall ('<TR> .*? <P .*? <P .*? <P .*? <P .*? <P. *?> (.*?) </P> .*? <P .*? <P. *?> (.*?) </P> .*? </TR> ', myPage, re. S) # earned credits
For item in myItems:
Self. weights. append (item [0]. encode ('gbk '))
Self. points. append (item [1]. encode ('gbk '))
# Calculate the score. If the score is not displayed or the score is excellent, the score is not calculated.
Def calculate_date (self ):
Point = 0.0
Weight = 0.0
For I in range (len (self. points )):
If (self. points [I]. isdigit ()):
Point + = string. atof (self. points [I]) * string. atof (self. weights [I])
Weight + = string. atof (self. weights [I])
Print point/weight
# Call
MySpider = SDU_Spider ()
MySpider. sdu_init ()

The above is a detailed record of the whole process of the birth of this crawler. Is there a magical catch ?? Haha, just kidding me. If you need a friend, please refer to it for free expansion.

How to Use python to write crawler programs

Here is a detailed introduction.

Blog.csdn.net/column/details/why-bug.html

In the scrapy framework, how does one use python to automatically redirect a crawler to a page to capture webpage content?

Crawlers can track the next page by simulating the next page connection and then sending new requests. See:
Item1 = Item () yield item1item2 = Item () yield item2req = Request (url = 'Next page link', callback = self. parse) yield req
Do not use the return statement when using yield.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Full record of python crawler writing without basic writing, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Full record of python crawler writing without basic writing, python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support