Python crawler entry (4)-Verification Code Part 1 (mainly about verification code verification process, excluding Verification Code cracking), python part 1
This article describes the verification process of the verification code, including how to implement the verification code, how to obtain the verification code, how to identify the verification code (this article is a person to identify, Machine recognition in the next article), and send the verification code. It is also illustrated in an example. Target web site http://icp.alexa.cn/index.php (query domain name filing information)
1. Verification code implementation:
Simply put, a verification code is an image with a string on it. How is a website implemented? People with a WEB Base may know that each browser basically has a cookie, which is the only identifier of this session. Each time you access a website, the browser sends the cookie to the server. The verification code is bound to this cookie. How can this problem be understood? For example, there are now websites W, two people A and B, and access W at the same time. W returns X to A and Y to B, both verification codes are correct. However, if A enters the verification code of B, the verification fails. Then how does the server distinguish A from B, that is, the cookie used. For another example, after you log on to some websites once, the next time you continue accessing the website, the website may automatically log on. The website uses cookies to identify the unique identity. If cookies are cleared, the Website Cannot log on automatically. We don't have to worry about the specific generation of cookies. We just need to know that it is a long string. You and others are different. (In this example, the target website does not use cookies, but uses other methods, so some bugs may exist)
The process of generating a verification code on the server background is easy to understand: first, generate a random string, bind it with the cookie, and then write it to the image and return it to you. So how to generate an image verification code? Below is a simple source code for generating verification codes:
from PIL import Imageimport ImageFilter,ImageDraw,ImageFontimport randomwidth = 80height = 40font = ImageFont.truetype('C:\\Windows\\Fonts\\AdobeFangsongStd-Regular.otf', 28)image = Image.new("RGB",(width,height),(0,0,0))draw = ImageDraw.Draw(image)for t in range(4): draw.text((20*t,10),`random.randint(0,9)`,font=font,fill=(255,255,255))image.show()
Code Description:
A). PIL is a python Image Library module and needs to be installed by yourself.
B). ImageFont. truetype () is to select the font
C ). image. new ("RGB", (width, height), (0, 0) creates an Image, and the background color is white (0, 0, 0 ), if you need other colors, you can query the color code by yourself. The canvas that comes with window can be seen:
D). random. randint () random number range greater than or equal to 0, less than or equal to 9
E ). draw. text (20 * t, 10), 'random. randint (255,255,255) ', font = font, fill = (), anchor = False) the first parameter represents the position, with two representing the content, and the third representing the font, the fourth represents the font color.
F). image. show () displays the image. The first word will prompt you to select the default image viewer.
The running result is as follows:
2). Obtain the verification code
A) analyze the target website. You can see that the verification code is displayed when you click the verification code input box,
So what is the request for obtaining the verification code? And the time when the request was sent? (The display time of the verification code is not necessarily the time when the verification code is obtained. In this example, the verification code may be loaded at the beginning of the page, but hidden all the time ). Firefox F12 open the console, find the network tag, and refresh the page, as shown in:
No request for obtaining the verification code is found. Click the input box of the Verification Code and find that there is another request. That's right. This is the request for obtaining the verification code.
B) Let's analyze the request. First click the request and you can see it, as shown in:
Complete URL: http://icp.alexa.cn/captcha.php? Q = sina.com.cn & sid = 82 & icp_host = sxcainfo. You can see three parameters:
Q = sina.com.cn: The queried Domain Name
Sid = 82: This ID does not know what it is. You will see it later in the JS source code.
Icp_host: This is unknown for the moment.
Which of the three parameters are required? It can be tested one by one. The test method is to delete an element and then send a request. The test found that the verification code can be obtained by default for all three parameters. The verification code is not available because it is not bound with values similar to some cookies, it is no different from the image and does not have the verification function. After my tests, (the test is very simple, but I need to use the following things. After reading this article, I will know how to test it.) All three parameters are required. During the test, I found that sid is a random number with no special meaning. You can enter the following JavaScript code as needed:
There are many icp_host values: sccainfo, ahcainfo, jscainfo... have you found any rules? The first eight letters and the last six are cainfo. What do the first two represent? SC = Sichuan, ah = Anhui, js = j Jiangsu. Therefore, we can guess that this is the abbreviation of a province. What is the use of this value? Function 1: If you do not enter a string (for example, aaaaaaaa) According to this rule, the Verification Code cannot be obtained. Function 2: The verification code is bound to this rule. That is to say, sccainfo is used when you obtain the verification code, and sccainfo is used for verification.
After analyzing the parameters, analyze the request header. The method is the same as the parameter analysis method. Delete the parameters one by one to see if the correct results can be obtained. At this time, you can write your own python code for testing. The specific code is as follows:
#encoding=utf8import urllib2from PIL import Imageimport cStringIOgetCode_url = "http://icp.alexa.cn/captcha.php?q=163.com&sid=0&icp_host=hncainfo"header={"Referer":"http://icp.alexa.cn/captcha.php?q=163.com&sid=0&icp_host=hncainfo"}# header['Host']="icp.alexa.cn"# header['User-Agent']="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"header['Cache-Control']="max-age=0"request = urllib2.Request(getCode_url,headers=header)res = urllib2.urlopen(request).read()image = Image.open(cStringIO.StringIO(res))image.show()
Code Description:
A). The stream module of cStringIO python can be converted into Stream files, whether it is an image, text, audio, or video. The function here is to restore an image stream to an image.
B) You can directly use header [''] =" "to add parameters to the header, so that you can test it. The specific parameters must be tested by yourself.
Running result:
3). Verify the verification code
A) analyze the target website to find the request for verification code. In the input box, enter the correct verification code and click Record Filing query ,:
One more request is displayed in the console.
Click request to view request details: http://icp.alexa.cn/index.php? Q = 163.com& code = 65a89c & icp_host = lncainfo
Three parameters:
Q = 163.com: the queried domain name is required
Code = 65a89c: verification code, required
Icp_host = lncainfo: corresponds to the obtained Verification Code
Then analyze the header, which is the same as the above method. The two tests are different: when obtaining the verification code, you write your own code to obtain the verification code, and then put it on the website for verification. The verification code is correct (the icp_host must be consistent ); when checking the verification code, you can use the website to obtain the verification code and enter it in the code to see if the parameters are correct.
The verification code is as follows:
#encoding=utf8import urllib2checkcode_url = "http://icp.alexa.cn/index.php?q=163.com&code=N3PE37&icp_host=hncainfo"header={}# header['Pragma']="Pragma"# header['Referer']="http://icp.alexa.cn/index.php?q=163.com&code=CUXWDV&icp_host=sccainfo"header['User-Agent']="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"request = urllib2.Request(checkcode_url,headers=header)res = urllib2.urlopen(request).read()print res
Code Description: The parameter icp_host = hncainfo must be the same as the one you used to obtain the verification code.
Running result:
If the verification code is incorrect or the verification code is inconsistent, the following code is returned:
At this point, we have finished the analysis, but now we have put the acquisition and verification in two codes. How can we put them together? The Code is as follows:
# Encoding = utf8import urllib2from PIL import Imageimport cStringIOimport BeautifulSoupdef getCode (domain): print "get verification code..." getcode_url = "http://icp.alexa.cn/captcha.php? Q = "+ domain +" & sid = 0 & icp_host = hncainfo "getcode_headers = {} getcode_headers ['Referer'] =" http://icp.alexa.cn/captcha.php? Q = 163.com& sid = 0 & icp_host = hncainfo "getcode_headers ['cache-control'] =" max-age = 0 "getcode_request = urllib2.Request (getcode_url, headers = getcode_headers) getcode_res = urllib2.urlopen (getcode_request ). read () image = Image. open (cStringIO. stringIO (getcode_res) print "Verification Code obtained successfully" image. show () def checkcode (domain, code): # print "the verification code you entered is:" + 'code' print "to check the verification code... "checkcode_url =" http://icp.alexa.cn/inde X. php? Q = "+ domain +" & code = "+ code +" & icp_host = hncainfo "checkcode_headers = {} checkcode_headers ['user-agent'] =" Mozilla/5.0 (Windows NT 6.3; WOW64; rv: 43.0) Gecko/20100101 Firefox/43.0 "checkcode_request = urllib2.Request (checkcode_url, headers = checkcode_headers) checkcode_res = urllib2.urlopen (checkcode_request ). read () if (checkcode_res.count ("organizer name")> 0): print "verified" checkcode_soup = BeautifulSoup. beautifulSoup (checkcode_res) print "unit name:" + checkcode_soup.findAll ("table") [0]. findAll ("tr") [0]. findAll ("td") [1]. text. encode ("utf8") else: print "Verification Failed" domain = raw_input ("Enter the domain name:") getCode (domain) code = raw_input ("Enter the Verification code :") checkcode (domain, code)
Code Description:
A). def getCode (domain) declares a function. getCode is the function name and domain is the parameter.
B). raw_input () get user input
C). When obtaining and verifying, I have written icp_host as hncainfo to ensure consistency.
D). encode ("utf8") encode variables in utf8 format
E). The verification code must be manually recognized and entered.
Running result:
At this point, the entire verification code is obtained and verified, and the verification code is identified in the next section.
Note:
A). The Code is only for learning and communication.
B) if any error occurs, please advise
C) indicate the source for reprinting.