Python Development web Crawler (iv): Login

Last Update:2015-08-07 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.jobbole.com/77878/

Blog – Bole Online

Home Latest Articles online courses industry development IT technology design entrepreneurship it workplace contributions more?

Bole Online > Home > All articles > Python > 0 Basic self-study with Python 3 development crawler (iv): Login

0 Basic self-study with Python 3 development crawler (iv): Login

Source: Jecvay Notes (@Jecvay) Welcome to share the original to Bole headlines

Today's work is interesting, we use Python to log in to the website, use cookies to record the login information, and then we can crawl after the login to see the information. Today we take the net to do the demonstration. Why is it known? It's hard to explain, but it's certainly a site that is so successful that I don't have to advertise for him at all. Knowledge of the network login is relatively simple, when the transmission of the user name and password is not encrypted, but not yet representative, there is a must jump from the home page login process.

Have to say, Fiddler This software is Tpircsboy told me. Thank him for bringing me such a funny thing.

First step: Observe browser behavior using Fiddler

Run the browser under the condition of Fiddler, enter the URL of the Web http://www.zhihu.com and then you can see the connection information captured in Fiddler. Select a 200 connection on the left, open the inspactors perspective on the right, above is the request message information for the connection, and below is the response message information.

Where the Raw tag is the original text of the display message. The response message below is probably not decompressed or decoded, in this case he will have a small tip in the middle, click on it can be decoded to show the original.

This is the time to enter the http://www.zhihu.com when not logged in. Now let's enter a username and password to login to the web, and then see what happens between the browser and the server.

After clicking Login, go back to Fiddler to see a new 200 link. Our browser carrier my account password to the server sent a POST, the content is as follows:

POST Http://www.zhihu.com/login http/1.1

content-type:application/x-www-form-urlencoded; Charset=utf-8

Accept: */*

X-requested-with:xmlhttprequest

referer:http://www.zhihu.com/#signin

accept-language:en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3

Accept-encoding:gzip, deflate

user-agent:mozilla/5.0 (Windows NT 6.4; WOW64; trident/7.0; rv:11.0) Like Gecko

content-length:97

Dnt:1

Host:www.zhihu.com

Connection:keep-alive

Pragma:no-cache

Cookie: __utma=51854390.1539896551.1412320246.1412320246.1412320246.1; __utmb=51854390.6.10.1412320246; __utmc=51854390; __utmz=51854390.1412320246.1.1.utmcsr= (direct) |utmccn= (direct) |utmcmd= (none); __utmv=51854390.000–|3=entry_date=20141003=1

_xsrf=4b41f6c7a9668187ccd8a610065b9718&email= here Black%40gmail.com&password= not visible here &rememberme=y

As follows:

My browser to http://www.zhihu.com/login this URL (more than a/login) sent a post, the content has been listed above, there is a user name, there is a password, there is a "remember Me" yes, where this WebForms tag Fi Ddler can be more organized to list the contents of the POST. So we can also use Python to send the same content to log in. But there is an entry with the Name _xsrf, and his value is 4b41f6c7a9668187ccd8a610065b9718. We need to get this value before we can send it to him.

Browser is how to get, we just first visited the http://www.zhihu.com/this URL, is the homepage, and then log in when he gave Http://www.zhihu.com/login this URL to send messages. So with the detective general thinking to think about this problem, you will find it is definitely the home page to send _XSRF generated to us, and then we sent this _xsrf to/login this URL. So after a while we're going to look for _xsrf from the first get response message.

The box below shows that we have not only logged in successfully, but also that the server tells our browser how to save the cookie information it gives. So we also use Python to record these Cookies.

So the Fiddler work is basically over!

Step Two: Unzip

Simply write a get program, to know the home get down, and then decode () decoding, the result error. After a closer look, we find that the data passed to us by gzip is compressed. So we need to extract the data first. It is easy to extract the gzip from Python because the built-in libraries are available. The code snippet is as follows:

Import gzip

def ungzip (data):

Try

# try to Unzip

Print (' Extracting ... ')

data = gzip.decompress (data)

Print (' Unzip complete! ')

Except

Print (' Uncompressed, no decompression ')

Return data

Through Opener.read () read back data, after ungzip automatic processing, and then again decode () can get decoded str.

Step two: Use regular expressions to get the boat of the desert

_XSRF the value of this key in the vast expanse of the Internet Desert Guide us to use the right posture to login to know, so _xsrf can be described as a desert boat. If there is no _xsrf, we may have a username and password can not login to know (I have not tried, but our school's educational system is true) as mentioned above, we can get the first time from the HTML code in the response message from the ship in the desert. The following function implements this function, and the returned STR is the value of _XSRF.

Import re

def getxsrf (data):

CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)

Strlist = Cer.findall (data)

return strlist[0]

Step three: Launch POST!!

Set _XSRF, ID, password Three magic weapon, we can launch post. Once the post is launched, we log on to the server and the server sends US Cookies. It would be a hassle to deal with cookies, but the Python Http.cookiejar library gives us a convenient solution, so long as a httpcookieprocessor is put in when the opener is created, the cookie thing is not It's in our possession. The following code shows this point.

Import Http.cookiejar

Import Urllib.request

Def Getopener (head):

# deal with the Cookies

CJ = Http.cookiejar.CookieJar ()

Pro = Urllib.request.HTTPCookieProcessor (CJ)

Opener = Urllib.request.build_opener (PRO)

Header = []

For key, value in Head.items ():

Elem = (key, value)

Header.append (Elem)

Opener.addheaders = Header

Return opener

The Getopener function receives a head parameter, which is a dictionary. The function converts the dictionary into a narimoto set and puts it into opener. So the opener that we built have two major functions:

Automatic processing of Cookies encountered during the use of opener

Automatically add a custom Header to the issued GET or POST request

Fourth: official operation

The official run is a little bit worse, we have to get the POST data into a opener.open () supported format. So we have to urllib.parse the UrlEncode () function in the library. This function converts data from a dictionary or tuple collection type to a & connected Str.

STR is not yet available, and is encoded by encode () to be used as the POST data parameter of Opener.open () or Urlopen (). The code is as follows:

21st

url = ' http://www.zhihu.com/'

Opener = Getopener (header)

op = opener.open (URL)

data = Op.read ()

data = ungzip (data)

# Unzip

_XSRF = GETXSRF (Data.decode ())

URL + = ' login '

id = ' Fill in your account number here '

Password = ' Fill in your password here '

Postdict = {

' _XSRF ': _XSRF,

' Email ': ID,

' Password ': password,

' RememberMe ': ' Y '

}

PostData = Urllib.parse.urlencode (postdict). Encode ()

op = Opener.open (URL, postdata)

data = Op.read ()

data = ungzip (data)

Print (Data.decode ())

# You can handle the data crawled back according to your liking!

After the code runs, we find that the dynamics of the people we care about (the ones that appear on the home page after landing) are crawled back. The next step is to do a statistical analyzer, or auto-push, or content classification automatic classifier, can be.

21st

Import gzip

Import re

Import Http.cookiejar

Import Urllib.request

Import Urllib.parse

def ungzip (data):

Try

# try to Unzip

Print (' Extracting ... ')

data = gzip.decompress (data)

Print (' Unzip complete! ')

Except

Print (' Uncompressed, no decompression ')

Return data

def getxsrf (data):

CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)

Strlist = Cer.findall (data)

return strlist[0]

Def Getopener (head):

# deal with the Cookies

CJ = Http.cookiejar.CookieJar ()

Pro = Urllib.request.HTTPCookieProcessor (CJ)

Opener = Urllib.request.build_opener (PRO)

Header = []

For key, value in Head.items ():

Elem = (key, value)

Header.append (Elem)

Opener.addheaders = Header

Return opener

Header = {

' Connection ': ' Keep-alive ',

' Accept ': ' text/html, Application/xhtml+xml, */* ',

' Accept-language ': ' en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3 ',

' User-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) Like Gecko ',

' accept-encoding ': ' gzip, deflate ',

' Host ': ' www.zhihu.com ',

' DNT ': ' 1 '

}

url = ' http://www.zhihu.com/'

Opener = Getopener (header)

op = opener.open (URL)

data = Op.read ()

data = ungzip (data)

# Unzip

_XSRF = GETXSRF (Data.decode ())

URL + = ' login '

id = ' Fill in your account number here '

Password = ' Fill in your password here '

Postdict = {

' _XSRF ': _XSRF,

' Email ': ID,

' Password ': password,

' RememberMe ': ' Y '

}

PostData = Urllib.parse.urlencode (postdict). Encode ()

op = Opener.open (URL, postdata)

data = Op.read ()

data = ungzip (data)

Print (Data.decode ())

Python Development web Crawler (iv): Login

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More