How to synchronize the verification code and cookies during python login simulation

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Automatic landing may be the first step to write a crawler, if you can't land, a lot of things can not climb. This is not the first time to write an automatic login script that contains authentication code recognition. It's a little bit of a pit this time, take this record down.
This time to automatically log on the website address is: 2013 Zhuzhou primary and secondary school teachers all staff training/indexpage/index.aspx
First of all, many people write those who do not need authentication code recognition of automatic landing script is easy, as long as the preservation of cookies can be, but for the need to verify the code site is always landing not up.
Steps for automatic landing scripts for sites that require authentication code: (Take the website I said above for example, for Python and other languages, ideas and steps are applicable)
A. Open the landing page first to obtain cookies.
B. Access the address of the authentication code again. The verification code is dynamic and is different each time it is opened.
C. Identify the authentication code. Here we need you to process and identify the verification code just obtained. Find your own authentication code (CAPTCHA) Identification Library, Python can use Pytesser (this library is called PiL to handle the recognition), Openc, or can be manually identified and then enter the verification code.
D. Construct the POST request data and request headers, and post the constructed request to the Web site
F. Get the response (response) information and pass the test to verify that the login was successful.
Or skip the A step directly:
B. Direct access to the address of the authentication code. (This is the reason to skip a, because when we open the verification code address, we have been able to obtain all the cookies we need and login verification code, so there is no need to operate a first step)
C. Identify the authentication code. Here we need you to process and identify the verification code just obtained. Find your own authentication code (CAPTCHA) Identification Library, Python can use Pytesser (this library is called PiL to handle the recognition), Openc, or can be manually identified and then enter the verification code.
D. Construct the POST request data and request headers, and post the constructed request to the Web site
F. Get the response (response) information and pass the test to verify that the login was successful.
#############################################################################
The difference between the two steps is nothing more than a visit to the landing page of the steps, no impact on landing. Here I use the browser to demonstrate the first step (that is, a, b,c,d,e,f steps), the next article to start more specific operations, access to modify the Cookies,post data structure.
Use the following steps to verify our ideas (or the site just now):
Tools to use:
Firebug extensions in the Firefox browser. You can also use Fiddler, or Httpfox,wireshark. It's convenient for me to use Firebug. Chrome also has firebug.
1. Access to the landing page to obtain cookies (equivalent to a step)
First empty Firefox's cache and cookies, open firebug, and monitor all Web pages. Then open the landing page:/indexpage/index.aspx. We got the first verification code: 2073, you can see the cookies obtained by Firebug,

2. Access to the authentication code address to obtain cookies and verification code (equivalent to B, c steps)

The authentication code address is found first. Right-click on the verification code, select "Copy Image Address", get the Verification code address:/guopeiadmin/login/imagelog.aspx, the following figure:

Test this verification code address is valid: do not close the landing page at the same time with Firefox to open the address of the verification code, did get a verification code, but also we obtained a second verification code:2876 , cookies or original cookies.

3. Landing site (equivalent to D, f steps)

Then, we started landing on the landing page just opened, enter the account and password, then I ask you, now enter which authentication code is the correct? You're going to answer 2073 this. Wrong, at this time, we should enter the second obtained authentication code: 2876 can log in normally. If you don't believe it, you can try it a few more times to verify it.

If we change the order of operation of the 1--2--3 just now: 2--1--3

First open the Verification code page to obtain a verification code: AAAA, and then open the landing page and get a verification code: bbbb, then login, enter the first AAAA authentication code landing, is wrong, unable to login successfully.

As for the second step, I really can't find a way to simulate this process, for example, use fiddler to access the authentication code address first, get cookies, and then use Fiddler to simulate the post to log on to the website. Unfortunately, I use n multiple post software, can not achieve these steps, always landing failed. So just use Python to simulate this process:

The modules and software required to pytesser modules and their calls should be installed by themselves , and the online Google tutorials should be installed. I pretended to be a long time ago.

Pytesser Download

http://code.google.com/p/pytesser/

Tesseract OCR engine Download:

http://code.google.com/p/tesseract-ocr/

PIL official download

http://www.pythonware.com/products/pil/

Python analog login Code;

The code is as follows

Copy Code

#-*-Coding:utf-8-*-

Import Urllib2
Import Cookielib
Import Urllib
Import Image
Import Cstringio
From Pytesser Import *
Import re
Import OS

#避免 unicodeencodeerror: ' ASCII ' codec can ' t encode character. The error
Import Sys
Reload (SYS)
Sys.setdefaultencoding ("Utf-8")

#下面这段是关键了, cookies will be bound for Urlib2.urlopen
#MozillaCookieJar (also can be lwpcookiejar, here to simulate Firefox, so use this) to provide a read-write operation of the cookie file, store the cookie object
Cookiejar = Cookielib. Mozillacookiejar ()
# The processor binds a save cookie object, and an HTTP cookie
Cookiesupport= Urllib2. Httpcookieprocessor (Cookiejar)
#下面两行为了调试的
HttpHandler = Urllib2. HttpHandler (debuglevel=1)
Httpshandler = Urllib2. Httpshandler (debuglevel=1)
#创建一个opener, the HTTP processor for the cookie is saved, and a handler is set to handle HTTP
Opener = Urllib2.build_opener (CookieSupport, Httpshandler)
#将包含了cookie, HTTP processors, HTTP handler resources and URLLIB2 objects are bound together, opener are installed, and Urlopen () is used after the opener object is invoked,
Urllib2.install_opener (opener)

#登陆页面
LoginPage = "Http://zhuzhou2013.feixuelixm.teacher.com.cn/IndexPage/Index.aspx"

#要post的url
loginurl = "Http://zhuzhou2013.feixuelixm.teacher.com.cn/GuoPeiAdmin/Login/Login.aspx"

# #打开登陆页面, to get cookies. But because # #打开验证码页面就可以获取全部cookies了, you can skip this step directly. Kind of dispensable.
#taobao = Urllib2.urlopen (loginpage)
# #打印cookies
#print Cookiejar
# The cookie that opens the authentication code page after #先打开页面获取的cookie与 is different.

# #提取验证码text (Enter the verification code manually)
#vrifycodeUrl = "Http://zhuzhou2013.feixuelixm.teacher.com.cn/GuoPeiAdmin/Login/ImageLog.aspx"
#file = Urllib2.urlopen (Vrifycodeurl)
#pic = File.read ()
#path = "C:code.jpg"
# #img = Cstringio.stringio (file) # Constructs a stringio holding the image Attributeerror:addinfourl instance has no ATT Ribute ' Seek '
#localpic = open (path, WB)
#localpic. Write (pic)
#localpic. Close ()
#print "Please%s,open code.jpg"%path
# #text =raw_input ("Input code:")
#im = Image.open (path)
#text =image_to_string (IM)
#print text

#提取验证码地址 (identify with Pytesser, find a tutorial on your own online installation)
#并且用pytesser Identify the authentication code, assign it to text, and print it out.
Vrifycodeurl = "Http://zhuzhou2013.feixuelixm.teacher.com.cn/GuoPeiAdmin/Login/ImageLog.aspx"
File = Urllib2.urlopen (Vrifycodeurl). Read ()
img = Cstringio.stringio (file) # Constructs a stringio holding the image Attributeerror:addinfourl instance has no Attri Bute ' seek '
im = Image.open (img)
Text = Image_to_string (IM)
Print "Vrifycode:", text

#设置cookie的值, because the POST request head needs to return a cookie (not cookies, which is the value of processing the cookie format)
cookies = '
#这里要从
For index, cookies in Enumerate (Cookiejar):
#print ' [', Index, '] ';
#print Cookie.name;
#print Cookie.value;
#print "###########################"
cookies = cookies+cookie.name+ "=" +cookie.value+ ";";
Print "###########################"
Cookie = cookies[:-1]
Print "Cookies:", cookies

#用户名, password
#当然, I've reached the top of the summit to dispose of passwords and user names
#username = "7879954564555664"
#password = "12313164"

#用户名, password
Username = "430223198809308045"
Password = "56961888"

#请求数据包
PostData = {
' __eventtarget ': ',
' __eventargument ': ',
' __viewstate ': '/ wepdwukltcymzeymty2nw8wah4ltg9naw5lzfbhz2ufeexvz2luzwrqywdllmfzchgwamypzbyczg8pzbyghgv0axrszqug55so5oi35zcnl+ wtpus5ooeggs/ Ouqvku73or4hlj7ceb29uzm9jdxmfegnozwnrsw5wdxqodghpcykebm9uymx1cguncmvzdg9yzsh0aglzkwqyaquex19db250cm9sc1jlcxvpcmvqb3n0qmfj A0tlev9ffgefc0ltz2j0bkxvz2luckjjpnhruswhtput33uj1dbukvw= ',
' txtUserName ': username,
' Txtpassword ':p assword,
' Txtcode ': text,
' Imgbtnlogin.x ': 44,
' IMGBTNLOGIN.Y ': 14,
' Clientscreenwidth ': 1180
}

#post请求头部
headers = {

' Accept ': ' Text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8 ',
    ' accept-language ': ' zh-cn,en-us;q=0.8,zh;q=0.5,en;q=0.3 ' ,
    ' accept-encoding ': ' gzip, deflate ',

    ' Host ':     ' zhuzhou2013.feixuelixm.teacher.com.cn ',
    ' cookies ': Cookies,
   ' user-agent ': ' mozilla/5.0 (Windows NT 5.1; rv:29.0) gecko/20100101 firefox/29.0 ',
  & nbsp ' Referer ': ' http://zhuzhou2013.feixuelixm.teacher.com.cn/GuoPeiAdmin/Login/Login.aspx ',
# ' Content-type ': ' application/x-www-form-urlencoded ',
# ' content-length ': 474,
    ' Connection ': ' keep-alive '

}

#合成post数据
data = Urllib.urlencode (postdata)
Print "data:###############"
Print data
#创建request
#构造request请求
Request = Urllib2. Request (Loginurl,data,headers)
Try
#访问页面
Response = Urllib2.urlopen (Request)
#cur_url = Response.geturl ()
#print "Cur_url:", Cur_url
Status = Response.getcode ()
Print status
Except Urllib2. Httperror, E:
Print E.code

#将响应的网页打印到文件中, make it easy for you to troubleshoot errors
#必须对网页进行解码处理
F= Response.read (). Decode ("UTF8")
OutFile =open ("Rel_ip.txt", "W")
Print >> outfile, "%s"% (f)

#但因响应的信息
info = Response.info ()
Print Info

#测试登陆是否成功, because in Testurl only after landing can access
Testurl = "Http://zhuzhou2013.feixuelixm.teacher.com.cn/GuoPeiAdmin/Login/LoginedPage.aspx"

Try
Response = Urllib2.urlopen (Testurl)
Except Urllib2. Httperror, E:
Print E.code

#因为后面要从网页查找字符来验证登陆成功与否, so make sure that the characters you look for are the same as the page code, otherwise you get the right conclusions. It is recommended to find in English, such as the ID in CSS, name and so on.
F= Response.read (). Decode ("UTF8"). Encode ("UTF8")
OutFile =open ("Out_ip.txt", "W")
Print >> outfile, "%s"% (f)

#在返回的网页中, look for "Hello" two characters, because only after the successful landing only two words, found that means landing success. Suggested in English
tag = ' Hello '. Encode ("UTF8")
If Re.search (tag,f):
#登陆成功
print ' Logged in successfully! '
Else
#登陆失败
print ' Logged in failed, check result.html file for details '

Response.close ()

#这个代码很随意, but easy to see, need to live, can be written as a function. There is Urlopen () in a large number of landing and inspection process, may read (0 because of network congestion and timeout (timeout), you need to set the Urlopen () timeout, or multiple send request

Summary: When you need to enter the verification code of the Web page to write automatic landing script. The key is to ensure that cookies and authentication codes are synchronized. If you open the verification code address directly to obtain the required access to all cookies (at this time the verification code and cookies must be synchronized), then you do not have to open the landing page to obtain cookies, and then open the verification code address to obtain the verification code. Why do you have to be superfluous?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More