Python uses KNN for verification code recognition.

Source: Internet
Author: User

Python uses KNN for verification code recognition.
Preface

Previously, I made an APP for campus dating. One logic is to use the user's educational administration system to confirm that the user is a college student. The basic idea is to use the user's account and password, the crawler method is used to confirm the information, but many educational administration systems have verification codes. At that time, the verification code was downloaded through a local server and then distributed to the client. Then, the user can enter the verification code by himself, submit the account and password to the server, and then simulate logon to the educational administration system to confirm whether the user can log on to the educational administration system. The verification code undoubtedly broke our idea of allowing users to quickly authenticate. However, at that time, there was no way. Recently, I read some machine learning content, I think that the simple verification codes of most schools can be cracked by KNN, So I sorted out my thoughts and made up my sleeves!

Analysis

Our school's verification code is like this: in fact, it is simply to rotate the characters and then add some weak noise to form. If we want to identify it, we have to go retrograde. The specific idea is to first remove the noise by binarization, split a single character, and finally rotate it to the standard direction, then, select the template from the processed images, and process the new verification code in the same way each time, and then compare it with these templates, select a template closest to the nearest one as its judgment result (K = 1 in this article ). Next, follow the steps.

Get Verification Code

First, we need to have a large number of verification codes, which can be implemented through crawlers. The Code is as follows:

#-*-Coding: UTF-8-*-import urllib, urllib2, cookielib, string, Imagedef getchk (number): # create cookie object cookie = cookielib. LWPCookieJar () cookieSupport = urllib2.HTTPCookieProcessor (cookie) opener = urllib2.build _ opener (cookieSupport, urllib2.HTTPHandler) url2.libinstall _ opener (opener) # obtain the cookie from the educational administration system for the first time # camouflage browser headers = {'access': 'text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp ,*/*; Q = 0.8 ', 'Accept-encoding': 'gzip, deflate', 'Accept-color': 'zh-CN, zh; q = 0.8 ', 'User-agent': 'mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) chrome/38.0.2125.111 Safari/537.36 '} req0 = urllib2.Request (url = 'HTTP: // mis.teach.ustc.edu.cn', headers = headers # request header) # capture http Error try: result0 = urllib2.urlopen (req0) failed t urllib2.HTTPError, e: print e. code # extract cookie getcookie = ['',] For item in cookie: getcookie. append (item. name) getcookie. append ("=") getcookie. append (item. value) getcookie = "". join (getcookie) # modify headers ["Origin"] = "http://mis.teach.ustc.edu.cn" headers ["Referer"] = "http://mis.teach.ustc.edu.cn/userinit.do" headers ["Content-Type"] = "application/x-www-form -urlencoded "headers [" Cookie "] = getcookie for I in range (number): req = urllib2.Request (url = "Http://mis.teach.ustc.edu.cn/randomImage.do? Date = '2013' ", headers = headers # request header) response = urllib2.urlopen (req) status = response. getcode () picData = response. read () if status = 200: localPic = open (". /source/"+ str (I) + ". jpg "," wb ") localPic. write (picData) localPic. close () else: print "failed to get Check Code" if _ name _ = '_ main _': getchk (500)

500 verification codes are downloaded to the source directory.

Binarization

Matlab's rich image processing functions can save us a lot of time. We traverse the source folder, perform binarization processing on each verification code image, and store the processed image to the bw directory. The Code is as follows:

Mydir = './source/'; bw = './bw/'; if mydir (end )~ = '\' Mydir = [mydir, '\']; endDIRS = dir ([mydir ,'*. jpg ']); % extension n = length (DIRS); for I = 1: n if ~ DIRS (I ). isdir img = imread (strcat (mydir, DIRS (I ). name); img = rgb2gray (img); % grayscale img = im2bw (img); % 0-1 binarization name = strcat (bw, DIRS (I ). name) imwrite (img, name); endend

Processing result

Split
Mydir = './bw/'; letter = './letter/'; if mydir (end )~ = '\' Mydir = [mydir, '\']; endDIRS = dir ([mydir ,'*. jpg ']); % extension n = length (DIRS); for I = 1: n if ~ DIRS (I ). isdir img = imread (strcat (mydir, DIRS (I ). name); img = im2bw (img); % binarization img = 1-img; % color Anti-transfer characters become Unicom domains, easy to remove noise for ii = region = [ii * 20 +,]; % divide a verification code into four 20*20 character pictures subimg = imcdrop (img, region); imlabel = bwlabel (subimg); % imshow (imlabel); if max (imlabel)> 1% indicates that there is noise, % max (imlabel) must be removed )) % imshow (subimg); stats = regionprops (imlabel, 'area'); Area = cat (1, stats. area); maxindex = find (area = max (area); area (maxindex) = 0; secondindex = find (area = max (area )); imindex = ismember (imlabel, secondindex); subimg (imindex = 1) = 0; % remove the second Dalian region, the noise cannot be greater than the character, so the second largest is the noise end name = strcat (letter, DIRS (I ). name (1: length(dirs( I %.name%-4},'_', num2str(ii%,'.jpg ') imwrite (subimg, name); end endend

Processing result

Rotate

Which criteria should I find for the next rotation? It is observed that the rotation of these characters does not exceed 60 degrees, so between the plus and minus 60 degrees, the uniform rotation to the character width is the smallest line. The Code is as follows:

If mydir (end )~ = '\' Mydir = [mydir, '\']; endDIRS = dir ([mydir ,'*. jpg ']); % extension n = length (DIRS); for I = 1: n if ~ DIRS (I ). isdir img = imread (strcat (mydir, DIRS (I ). name); img = im2bw (img); minwidth = 20; for angle =-60: 60 imgr = imrotate (img, angle, 'bilinear ', 'crop '); % crop avoid image size changes imlabel = bwlabel (imgr); stats = regionprops (imlabel, 'area'); Area = cat (1, stats. area); maxindex = find (area = max (area); imindex = ismember (imlabel, maxindex); % The maximum Dalian region is 1 [y, x] = find (imindex = 1); width = max (x)-min (x) + 1; if width <minwidth = width; imgrr = imgr; end name = strcat (rotate, DIRS (I ). name) imwrite (imgrr, name); endend

Processing result: a 2000-character image exists in the rotate folder.

Template Selection

Now select a template from the rotate folder, covering each character. You can select multiple images for one character, even if there is a lot of processing before, there is no guarantee that there is only one final representation of a character. You can select multiple to ensure coverage. Saves the selected template image to the samples folder, which is time-consuming and labor-consuming. You can ask for help ~,

Test

The test code is as follows: first, perform the preceding operations on the test verification code, and then compare it with the selected template. The template with the minimum difference value is used as the character selection for the test sample. The Code is as follows:

% The graph with the minimum difference is used as the answer

Mydir = './test/'; samples = './samples/'; if mydir (end )~ = '\' Mydir = [mydir, '\']; endif samples (end )~ = '\' Samples = [samples, '\']; endDIRS = dir ([mydir, '*. jpg']); % extension? DIRS1 = dir ([samples ,'*. jpg ']); % extension n = length (DIRS); % Total number of verification codes singleerror = 0; % single error uniterror = 0; % Number of Verification Code Errors for I = 1: n if ~ DIRS (I ). isdir realcodes = DIRS (I ). name (); fprintf ('actual Verification Code character: % s \ n', realcodes); img = imread (strcat (mydir, DIRS (I ). name); img = rgb2gray (img); img = im2bw (img); img = 1-img; % the color Anti-transfer character becomes the China Unicom domain subimgs = []; for ii = 0: 3 region = [ii * 20 +, 19,20]; % strange, why can we share the same share in this way? Subimg = imcrop (img, region); imlabel = bwlabel (subimg); if max (imlabel)> 1% indicates that there are miscellaneous stats = regionprops (imlabel, 'area '); area = cat (1, stats. area); maxindex = find (area = max (area); area (maxindex) = 0; secondindex = find (area = max (area )); imindex = ismember (imlabel, secondindex); subimg (imindex = 1) = 0; % remove the second Dalian region end subimgs = [subimgs; subimg]; end codes = []; for ii = 0: 3 region = [ii * 20 + 1, 1, 19, 20]; subimg = imcrop (img, region); minwidth = 20; for angle =-60: 60 imgr = imrotate (subimg, angle, 'bilinear ', 'crop'); % crop avoid image size changes imlabel = bwlabel (imgr); stats = regionprops (imlabel, 'region'); Area = cat (1, stats. area); maxindex = find (area = max (area); imindex = ismember (imlabel, maxindex); % The maximum Dalian region is 1 [y, x] = find (imindex = 1); width = max (x)-min (x) + 1; if width <minwidth = width; imgrr = imgr; End mindiffv = 1000000; for jj = 1: length (DIRS1) imgsample = imread (strcat (samples, DIRS1 (jj ). name); imgsample = im2bw (imgsample); diffv = abs (imgsample-imgrr); alldiffv = sum (diffv); if alldiffv <mindiffv = alldiffv; code = DIRS1 (jj ). name; code = code (1); end codes = [codes, code]; end fprintf ('verification code test character: % s \ n', codes ); num = codes-realcodes; num = length (find (num ~ = 0); singleerror = singleerror + num; if num> 0 uniterror = uniterror + 1; end fprintf ('number of errors: % d \ n', num ); endfprintf ('\ n ----- Result Statistics ----- \ n \ n'); fprintf ('number of Characters in the verification code: % d \ n', n * 4 ); fprintf ('number of Character Errors in the verification code: % d \ n', singleerror); fprintf ('single character recognition accuracy rate: %. 2f % \ n', (1-singleerror/(n * 4) * 100); fprintf ('number of Verification Code diagrams: % d \ n', n ); fprintf ('number of errors in the verification code diagram: % d \ n', uniterror); fprintf ('probability of filling in the verification code: %. 2f % \ n', (1-uniterror/n) * 100 );

Result:

Actual verification code character: 2B4E
Verification Code test character: 2B4F
Error count: 1
Actual verification code character: 4572
Verification Code test character: 4572
Error count: 0
Actual verification code character: 52CY
Verification Code test character: 52LY
Error count: 1
Actual verification code character: 83QG
Verification Code test character: 85QG
Error count: 1
Actual verification code character: 9992
Verification Code test character: 9992
Error count: 0
Actual verification code character: A7Y7
Verification Code test character: A7Y7
Error count: 0
Actual verification code character: D993
Verification Code test character: D995
Error count: 1
Actual verification code character: F549
Verification Code test character: F5A9
Error count: 1
Actual verification code character: FMC6
Verification Code test character: FMLF
Error count: 2
Actual verification code character: R4N4
Verification Code test character: R4N4
Error count: 0

----- The result is as follows -----

Number of characters for the test verification code: 40
Number of Character Errors in the test verification code: 7
Single Character recognition accuracy rate: 82.50%
Number of Verification Code charts: 10
Number of errors in the test verification code diagram: 6
Probability of filling in the verification code: 40.00%

It can be seen that the accuracy of a single character is relatively high, but the overall accuracy is still not good. It is observed that the wrong characters are obfuscated characters, such as E, F, C, and L, 5, 3, 4, and A, so what we can do is to increase the number of samples in the template to minimize obfuscation.

After dozens of samples are added, perform the test again:

Actual verification code character: 2B4E
Verification Code test character: 2B4F
Error count: 1
Actual verification code character: 4572
Verification Code test character: 4572
Error count: 0
Actual verification code character: 52CY
Verification Code test character: 52LY
Error count: 1
Actual verification code character: 83QG
Verification Code test character: 83QG
Error count: 0
Actual verification code character: 9992
Verification Code test character: 9992
Error count: 0
Actual verification code character: A7Y7
Verification Code test character: A7Y7
Error count: 0
Actual verification code character: D993
Verification Code test character: D993
Error count: 0
Actual verification code character: F549
Verification Code test character: F5A9
Error count: 1
Actual verification code character: FMC6
Verification Code test character: FMLF
Error count: 2
Actual verification code character: R4N4
Verification Code test character: R4N4
Error count: 0

----- The result is as follows -----

Number of characters for the test verification code: 40
Number of Character Errors in the test verification code: 5
Single Character recognition accuracy rate: 87.50%
Number of Verification Code charts: 10
Number of errors in the test verification code diagram: 4
Probability of filling in the verification code: 60.00%

It can be seen that both the recognition accuracy of a single character and the probability of the entire verification code being correct have been improved. It is foreseeable that the accuracy rate will continue to increase as the number of templates increases.

Summary

The scalability of this method is very weak, and it is only applicable to simple verification codes. 12306 is not enough.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.