Python Development web Crawler (iv): Login

Source: Internet
Author: User
Tags urlencode

http://blog.jobbole.com/77878/

Blog – Bole Online

Home Latest Articles online courses industry development IT technology design entrepreneurship it workplace contributions more?

Bole Online > Home > All articles > Python > 0 Basic self-study with Python 3 development crawler (iv): Login

0 Basic self-study with Python 3 development crawler (iv): Login


Source: Jecvay Notes (@Jecvay) Welcome to share the original to Bole headlines

Today's work is interesting, we use Python to log in to the website, use cookies to record the login information, and then we can crawl after the login to see the information. Today we take the net to do the demonstration. Why is it known? It's hard to explain, but it's certainly a site that is so successful that I don't have to advertise for him at all. Knowledge of the network login is relatively simple, when the transmission of the user name and password is not encrypted, but not yet representative, there is a must jump from the home page login process.


Have to say, Fiddler This software is Tpircsboy told me. Thank him for bringing me such a funny thing.



First step: Observe browser behavior using Fiddler


Run the browser under the condition of Fiddler, enter the URL of the Web http://www.zhihu.com and then you can see the connection information captured in Fiddler. Select a 200 connection on the left, open the inspactors perspective on the right, above is the request message information for the connection, and below is the response message information.

Where the Raw tag is the original text of the display message. The response message below is probably not decompressed or decoded, in this case he will have a small tip in the middle, click on it can be decoded to show the original.





This is the time to enter the http://www.zhihu.com when not logged in. Now let's enter a username and password to login to the web, and then see what happens between the browser and the server.





After clicking Login, go back to Fiddler to see a new 200 link. Our browser carrier my account password to the server sent a POST, the content is as follows:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

POST Http://www.zhihu.com/login http/1.1

content-type:application/x-www-form-urlencoded; Charset=utf-8

Accept: */*

X-requested-with:xmlhttprequest

referer:http://www.zhihu.com/#signin

accept-language:en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3

Accept-encoding:gzip, deflate

user-agent:mozilla/5.0 (Windows NT 6.4; WOW64; trident/7.0; rv:11.0) Like Gecko

content-length:97

Dnt:1

Host:www.zhihu.com

Connection:keep-alive

Pragma:no-cache

Cookie: __utma=51854390.1539896551.1412320246.1412320246.1412320246.1; __utmb=51854390.6.10.1412320246; __utmc=51854390; __utmz=51854390.1412320246.1.1.utmcsr= (direct) |utmccn= (direct) |utmcmd= (none); __utmv=51854390.000–|3=entry_date=20141003=1

_xsrf=4b41f6c7a9668187ccd8a610065b9718&email= here Black%40gmail.com&password= not visible here &rememberme=y

As follows:




My browser to http://www.zhihu.com/login this URL (more than a/login) sent a post, the content has been listed above, there is a user name, there is a password, there is a "remember Me" yes, where this WebForms tag Fi Ddler can be more organized to list the contents of the POST. So we can also use Python to send the same content to log in. But there is an entry with the Name _xsrf, and his value is 4b41f6c7a9668187ccd8a610065b9718. We need to get this value before we can send it to him.


Browser is how to get, we just first visited the http://www.zhihu.com/this URL, is the homepage, and then log in when he gave Http://www.zhihu.com/login this URL to send messages. So with the detective general thinking to think about this problem, you will find it is definitely the home page to send _XSRF generated to us, and then we sent this _xsrf to/login this URL. So after a while we're going to look for _xsrf from the first get response message.


The box below shows that we have not only logged in successfully, but also that the server tells our browser how to save the cookie information it gives. So we also use Python to record these Cookies.


So the Fiddler work is basically over!



Step Two: Unzip


Simply write a get program, to know the home get down, and then decode () decoding, the result error. After a closer look, we find that the data passed to us by gzip is compressed. So we need to extract the data first. It is easy to extract the gzip from Python because the built-in libraries are available. The code snippet is as follows:


1

2

3

4

5

6

7

8

9

Import gzip

def ungzip (data):

Try

# try to Unzip

Print (' Extracting ... ')

data = gzip.decompress (data)

Print (' Unzip complete! ')

Except

Print (' Uncompressed, no decompression ')

Return data

Through Opener.read () read back data, after ungzip automatic processing, and then again decode () can get decoded str.



Step two: Use regular expressions to get the boat of the desert


_XSRF the value of this key in the vast expanse of the Internet Desert Guide us to use the right posture to login to know, so _xsrf can be described as a desert boat. If there is no _xsrf, we may have a username and password can not login to know (I have not tried, but our school's educational system is true) as mentioned above, we can get the first time from the HTML code in the response message from the ship in the desert. The following function implements this function, and the returned STR is the value of _XSRF.


1

2

3

4

5

Import re

def getxsrf (data):

CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)

Strlist = Cer.findall (data)

return strlist[0]

Step three: Launch POST!!


Set _XSRF, ID, password Three magic weapon, we can launch post. Once the post is launched, we log on to the server and the server sends US Cookies. It would be a hassle to deal with cookies, but the Python Http.cookiejar library gives us a convenient solution, so long as a httpcookieprocessor is put in when the opener is created, the cookie thing is not It's in our possession. The following code shows this point.


1

2

3

4

5

6

7

8

9

10

11

12

13

Import Http.cookiejar

Import Urllib.request

Def Getopener (head):

# deal with the Cookies

CJ = Http.cookiejar.CookieJar ()

Pro = Urllib.request.HTTPCookieProcessor (CJ)

Opener = Urllib.request.build_opener (PRO)

Header = []

For key, value in Head.items ():

Elem = (key, value)

Header.append (Elem)

Opener.addheaders = Header

Return opener

The Getopener function receives a head parameter, which is a dictionary. The function converts the dictionary into a narimoto set and puts it into opener. So the opener that we built have two major functions:


Automatic processing of Cookies encountered during the use of opener

Automatically add a custom Header to the issued GET or POST request


Fourth: official operation


The official run is a little bit worse, we have to get the POST data into a opener.open () supported format. So we have to urllib.parse the UrlEncode () function in the library. This function converts data from a dictionary or tuple collection type to a & connected Str.


STR is not yet available, and is encoded by encode () to be used as the POST data parameter of Opener.open () or Urlopen (). The code is as follows:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21st

22

url = ' http://www.zhihu.com/'

Opener = Getopener (header)

op = opener.open (URL)

data = Op.read ()

data = ungzip (data)

# Unzip

_XSRF = GETXSRF (Data.decode ())

URL + = ' login '

id = ' Fill in your account number here '

Password = ' Fill in your password here '

Postdict = {

' _XSRF ': _XSRF,

' Email ': ID,

' Password ': password,

' RememberMe ': ' Y '

}

PostData = Urllib.parse.urlencode (postdict). Encode ()

op = Opener.open (URL, postdata)

data = Op.read ()

data = ungzip (data)

Print (Data.decode ())

# You can handle the data crawled back according to your liking!

After the code runs, we find that the dynamics of the people we care about (the ones that appear on the home page after landing) are crawled back. The next step is to do a statistical analyzer, or auto-push, or content classification automatic classifier, can be.


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21st

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

Import gzip

Import re

Import Http.cookiejar

Import Urllib.request

Import Urllib.parse

def ungzip (data):

Try

# try to Unzip

Print (' Extracting ... ')

data = gzip.decompress (data)

Print (' Unzip complete! ')

Except

Print (' Uncompressed, no decompression ')

Return data

def getxsrf (data):

CER = Re.compile (' name=\ ' _xsrf\ "value=\" (. *) \ "', flags = 0)

Strlist = Cer.findall (data)

return strlist[0]

Def Getopener (head):

# deal with the Cookies

CJ = Http.cookiejar.CookieJar ()

Pro = Urllib.request.HTTPCookieProcessor (CJ)

Opener = Urllib.request.build_opener (PRO)

Header = []

For key, value in Head.items ():

Elem = (key, value)

Header.append (Elem)

Opener.addheaders = Header

Return opener

Header = {

' Connection ': ' Keep-alive ',

' Accept ': ' text/html, Application/xhtml+xml, */* ',

' Accept-language ': ' en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3 ',

' User-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) Like Gecko ',

' accept-encoding ': ' gzip, deflate ',

' Host ': ' www.zhihu.com ',

' DNT ': ' 1 '

}

url = ' http://www.zhihu.com/'

Opener = Getopener (header)

op = opener.open (URL)

data = Op.read ()

data = ungzip (data)

# Unzip

_XSRF = GETXSRF (Data.decode ())

URL + = ' login '

id = ' Fill in your account number here '

Password = ' Fill in your password here '

Postdict = {

' _XSRF ': _XSRF,

' Email ': ID,

' Password ': password,

' RememberMe ': ' Y '

}

PostData = Urllib.parse.urlencode (postdict). Encode ()

op = Opener.open (URL, postdata)

data = Op.read ()

data = ungzip (data)

Print (Data.decode ())


Python Development web Crawler (iv): Login

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.