Some ideas on using Python for verification code recognition

Last Update:2015-02-05 Source: Internet

Author: User

Tags tesseract ocr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint: @ Xiao Wu Yi Http://www.cnblogs.com/xiaowuyi

Python Plus "Verification Code" for keywords in Baidu Search, you can find a lot of verification code identification of the article. I have a general look, the main methods have several categories: one is through the processing of pictures, and then use the Font feature matching method, a kind of image processing after the establishment of character dictionary, there is a kind of direct use of OCR module to identify. Regardless of the method, you need to first deal with the image, so try to analyze the following verification code.
First, image processing

The main factor in this verification code is the middle curve, first consider removing the curve from the picture. Two kinds of algorithms are considered:
The first is to take the position of the curve head first, that is, the position of the black point when x=0. Then move back the X value, observe the location of the black dots under each X, and determine the distance between the two adjacent black dots, if the distance is within a certain range, you can basically judge the point is the point on the curve, and finally the point on the curve is painted white. Try this method, the results obtained by the picture effect is very general, the curve can not be completely removed, and the capacity of the character line removal.
The second consideration is to calculate the density of points within the unit area. So first calculate the number of points within the unit area, the unit area within the number of points less than a specified number of areas to remove, the remaining part is basically the verification code character part. In this case, for ease of operation, the 5*5 is taken as the unit range and the standard density of the points within the unit area is adjusted to 11. The effect after processing:

Second, character verification
The method I use here is to use Pytesser for OCR recognition, but because of the irregularity of this kind of verification code character, the accuracy of the verification result is not very high. Specifically which Daniel, there is any good way, hope to give advice.
III. Preparation Work and code examples
1, PIL, Pytesser, tesseract
(1) Installation pil::http://www.pythonware.com/products/pil/
(2) pytesser::http://code.google.com/p/pytesser/, after the download decompression directly placed in the same Code folder, you can use.
(3) Tesseract OCR engine Download: http://code.google.com/p/tesseract-ocr/, unzip after download, find Tessdata folder, Use it to replace the Pytesser extracted Tessdata folder.
2. Specific code

#encoding =utf-8## #利用点的密度计算import image,imageenhance,imagefilter,imagedrawimport sysfrom pytesser Import *# Count the number of points within the range def numpoint (IM): w,h = im.size data = List (Im.getdata ()) mumpoint=0 for X in range (W): For Y in range (h): if data[y*w + x]!=255: #255是白色 mumpoint+=1 return Mumpoint #计            Calculates the density of points within the 5*5 range def pointmidu (IM): W,h = Im.size p=[] for y in range (0,h,5): For x in Range (0,w,5): box = (x, y, x+5,y+5) im1=im.crop (box) a=numpoint (IM1) if a<11:# #如果5 less than 11 points in the range, then the part is fully                Part to White. For I in Range (x,x+5): for J in Range (y,y+5): Im.putpixel ((i,j), 255) Im.save (R ' img.jpg ') def ocrend (): # #识别 image_name = "img.jpg" im = Image.open (image_name) im = Im.filter (imagefilt Er. Medianfilter ()) enhancer = Imageenhance.contrast (im) im = enhancer.enhance (2) im = Im.convert (' 1 ') Im.save ("1. TIF ") printImage_file_to_string (' 1.tif ') if __name__== ' __main__ ': image_name = "1.png" im = Image.open (image_name) IM     = Im.filter (imagefilter.detail) im = Im.filter (imagefilter.medianfilter ()) enhancer = Imageenhance.contrast (IM) im = Enhancer.enhance (2) im = Im.convert (' 1 ') # #a =remove_point (IM) Pointmidu (IM) ocrend ()

I this method, the final recognition rate is really not high, write out, which Master has good ideas or practices, hope to enlighten!

Some ideas on using Python for verification code recognition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More