http://blog.jobbole.com/77878/
Blog – Bole Online
Home Latest Articles online courses industry development IT technology design entrepreneurship it workplace contributions more?
Bole Online > Home > All articles > Python > 0 Basic self-study with Python 3 development crawler (iv): Login
0 Basic self-study with Python 3 development crawler (iv): Login
Source: Jecvay Notes (@Jecvay) Welcome to share the original to Bole headlines
Today's work is interesting, we use Python to log in to the website, use cookies to record the login information, and then we can crawl after the login to see the information. Today we take the net to do the demonstration. Why is it known? It's hard to explain, but it's certainly a site that is so successful that I don't have to advertise for him at all. Knowledge of the network login is relatively simple, when the transmission of the user name and password is not encrypted, but not yet representative, there is a must jump from the home page login process.
Have to say, Fiddler This software is Tpircsboy told me. Thank him for bringing me such a funny thing.
First step: Observe browser behavior using Fiddler
Run the browser under the condition of Fiddler, enter the URL of the Web http://www.zhihu.com and then you can see the connection information captured in Fiddler. Select a 200 connection on the left, open the inspactors perspective on the right, above is the request message information for the connection, and below is the response message information.
Where the Raw tag is the original text of the display message. The response message below is probably not decompressed or decoded, in this case he will have a small tip in the middle, click on it can be decoded to show the original.
This is the time to enter the http://www.zhihu.com when not logged in. Now let's enter a username and password to login to the web, and then see what happens between the browser and the server.
After clicking Login, go back to Fiddler to see a new 200 link. Our browser carrier my account password to the server sent a POST, the content is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
POST Http://www.zhihu.com/login http/1.1
content-type:application/x-www-form-urlencoded; Charset=utf-8
Accept: */*
X-requested-with:xmlhttprequest
referer:http://www.zhihu.com/#signin
accept-language:en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3
Accept-encoding:gzip, deflate
user-agent:mozilla/5.0 (Windows NT 6.4; WOW64; trident/7.0; rv:11.0) Like Gecko
content-length:97
Dnt:1
Host:www.zhihu.com
Connection:keep-alive
Pragma:no-cache
Cookie: __utma=51854390.1539896551.1412320246.1412320246.1412320246.1; __utmb=51854390.6.10.1412320246; __utmc=51854390; __utmz=51854390.1412320246.1.1.utmcsr= (direct) |utmccn= (direct) |utmcmd= (none); __utmv=51854390.000–|3=entry_date=20141003=1
_xsrf=4b41f6c7a9668187ccd8a610065b9718&email= here Black%40gmail.com&password= not visible here &rememberme=y
As follows:
My browser to http://www.zhihu.com/login this URL (more than a/login) sent a post, the content has been listed above, there is a user name, there is a password, there is a "remember Me" yes, where this WebForms tag Fi Ddler can be more organized to list the contents of the POST. So we can also use Python to send the same content to log in. But there is an entry with the Name _xsrf, and his value is 4b41f6c7a9668187ccd8a610065b9718. We need to get this value before we can send it to him.
Browser is how to get, we just first visited the http://www.zhihu.com/this URL, is the homepage, and then log in when he gave Http://www.zhihu.com/login this URL to send messages. So with the detective general thinking to think about this problem, you will find it is definitely the home page to send _XSRF generated to us, and then we sent this _xsrf to/login this URL. So after a while we're going to look for _xsrf from the first get response message.
The box below shows that we have not only logged in successfully, but also that the server tells our browser how to save the cookie information it gives. So we also use Python to record these Cookies.
So the Fiddler work is basically over!
Step Two: Unzip
Simply write a get program, to know the home get down, and then decode () decoding, the result error. After a closer look, we find that the data passed to us by gzip is compressed. So we need to extract the data first. It is easy to extract the gzip from Python because the built-in libraries are available. The code snippet is as follows:
1
2
3
4
5
6
7
8
9
Import gzip
def ungzip (data):
Try
# try to Unzip
Print (' Extracting ... ')
data = gzip.decompress (data)
Print (' Unzip complete! ')
Except
Print (' Uncompressed, no decompression ')
Return data
Through Opener.read () read back data, after ungzip automatic processing, and then again decode () can get decoded str.
Step two: Use regular expressions to get the boat of the desert
_XSRF the value of this key in the vast expanse of the Internet Desert Guide us to use the right posture to login to know, so _xsrf can be described as a desert boat. If there is no _xsrf, we may have a username and password can not login to know (I have not tried, but our school's educational system is true) as mentioned above, we can get the first time from the HTML code in the response message from the ship in the desert. The following function implements this function, and the returned STR is the value of _XSRF.
1
2
3
4
5
Import re
def getxsrf (data):
CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)
Strlist = Cer.findall (data)
return strlist[0]
Step three: Launch POST!!
Set _XSRF, ID, password Three magic weapon, we can launch post. Once the post is launched, we log on to the server and the server sends US Cookies. It would be a hassle to deal with cookies, but the Python Http.cookiejar library gives us a convenient solution, so long as a httpcookieprocessor is put in when the opener is created, the cookie thing is not It's in our possession. The following code shows this point.
1
2
3
4
5
6
7
8
9
10
11
12
13
Import Http.cookiejar
Import Urllib.request
Def Getopener (head):
# deal with the Cookies
CJ = Http.cookiejar.CookieJar ()
Pro = Urllib.request.HTTPCookieProcessor (CJ)
Opener = Urllib.request.build_opener (PRO)
Header = []
For key, value in Head.items ():
Elem = (key, value)
Header.append (Elem)
Opener.addheaders = Header
Return opener
The Getopener function receives a head parameter, which is a dictionary. The function converts the dictionary into a narimoto set and puts it into opener. So the opener that we built have two major functions:
Automatic processing of Cookies encountered during the use of opener
Automatically add a custom Header to the issued GET or POST request
Fourth: official operation
The official run is a little bit worse, we have to get the POST data into a opener.open () supported format. So we have to urllib.parse the UrlEncode () function in the library. This function converts data from a dictionary or tuple collection type to a & connected Str.
STR is not yet available, and is encoded by encode () to be used as the POST data parameter of Opener.open () or Urlopen (). The code is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
url = ' http://www.zhihu.com/'
Opener = Getopener (header)
op = opener.open (URL)
data = Op.read ()
data = ungzip (data)
# Unzip
_XSRF = GETXSRF (Data.decode ())
URL + = ' login '
id = ' Fill in your account number here '
Password = ' Fill in your password here '
Postdict = {
' _XSRF ': _XSRF,
' Email ': ID,
' Password ': password,
' RememberMe ': ' Y '
}
PostData = Urllib.parse.urlencode (postdict). Encode ()
op = Opener.open (URL, postdata)
data = Op.read ()
data = ungzip (data)
Print (Data.decode ())
# You can handle the data crawled back according to your liking!
After the code runs, we find that the dynamics of the people we care about (the ones that appear on the home page after landing) are crawled back. The next step is to do a statistical analyzer, or auto-push, or content classification automatic classifier, can be.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
Import gzip
Import re
Import Http.cookiejar
Import Urllib.request
Import Urllib.parse
def ungzip (data):
Try
# try to Unzip
Print (' Extracting ... ')
data = gzip.decompress (data)
Print (' Unzip complete! ')
Except
Print (' Uncompressed, no decompression ')
Return data
def getxsrf (data):
CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)
Strlist = Cer.findall (data)
return strlist[0]
Def Getopener (head):
# deal with the Cookies
CJ = Http.cookiejar.CookieJar ()
Pro = Urllib.request.HTTPCookieProcessor (CJ)
Opener = Urllib.request.build_opener (PRO)
Header = []
For key, value in Head.items ():
Elem = (key, value)
Header.append (Elem)
Opener.addheaders = Header
Return opener
Header = {
' Connection ': ' Keep-alive ',
' Accept ': ' text/html, Application/xhtml+xml, */* ',
' Accept-language ': ' en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3 ',
' User-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) Like Gecko ',
' accept-encoding ': ' gzip, deflate ',
' Host ': ' www.zhihu.com ',
' DNT ': ' 1 '
}
url = ' http://www.zhihu.com/'
Opener = Getopener (header)
op = opener.open (URL)
data = Op.read ()
data = ungzip (data)
# Unzip
_XSRF = GETXSRF (Data.decode ())
URL + = ' login '
id = ' Fill in your account number here '
Password = ' Fill in your password here '
Postdict = {
' _XSRF ': _XSRF,
' Email ': ID,
' Password ': password,
' RememberMe ': ' Y '
}
PostData = Urllib.parse.urlencode (postdict). Encode ()
op = Opener.open (URL, postdata)
data = Op.read ()
data = ungzip (data)
Print (Data.decode ())
Python Development web Crawler (iv): Login