Reprint: @ Xiao Wu Yi Http://www.cnblogs.com/xiaowuyi
Python Plus "Verification Code" for keywords in Baidu Search, you can find a lot of verification code identification of the article. I have a general look, the main methods have several categories: one is through the processing of pictures, and then use the Font feature matching method, a kind of image processing after the establishment of character dictionary, there is a kind of direct use of OCR module to identify. Regardless of the method, you need to first deal with the image, so try to analyze the following verification code.
First, image processing
The main factor in this verification code is the middle curve, first consider removing the curve from the picture. Two kinds of algorithms are considered:
The first is to take the position of the curve head first, that is, the position of the black point when x=0. Then move back the X value, observe the location of the black dots under each X, and determine the distance between the two adjacent black dots, if the distance is within a certain range, you can basically judge the point is the point on the curve, and finally the point on the curve is painted white. Try this method, the results obtained by the picture effect is very general, the curve can not be completely removed, and the capacity of the character line removal.
The second consideration is to calculate the density of points within the unit area. So first calculate the number of points within the unit area, the unit area within the number of points less than a specified number of areas to remove, the remaining part is basically the verification code character part. In this case, for ease of operation, the 5*5 is taken as the unit range and the standard density of the points within the unit area is adjusted to 11. The effect after processing:
Second, character verification
The method I use here is to use Pytesser for OCR recognition, but because of the irregularity of this kind of verification code character, the accuracy of the verification result is not very high. Specifically which Daniel, there is any good way, hope to give advice.
III. Preparation Work and code examples
1, PIL, Pytesser, tesseract
(1) Installation pil::http://www.pythonware.com/products/pil/
(2) pytesser::http://code.google.com/p/pytesser/, after the download decompression directly placed in the same Code folder, you can use.
(3) Tesseract OCR engine Download: http://code.google.com/p/tesseract-ocr/, unzip after download, find Tessdata folder, Use it to replace the Pytesser extracted Tessdata folder.
2. Specific code
#encoding =utf-8## #利用点的密度计算import image,imageenhance,imagefilter,imagedrawimport sysfrom pytesser Import *# Count the number of points within the range def numpoint (IM): w,h = im.size data = List (Im.getdata ()) mumpoint=0 for X in range (W): For Y in range (h): if data[y*w + x]!=255: #255是白色 mumpoint+=1 return Mumpoint #计 Calculates the density of points within the 5*5 range def pointmidu (IM): W,h = Im.size p=[] for y in range (0,h,5): For x in Range (0,w,5): box = (x, y, x+5,y+5) im1=im.crop (box) a=numpoint (IM1) if a<11:# #如果5 less than 11 points in the range, then the part is fully Part to White. For I in Range (x,x+5): for J in Range (y,y+5): Im.putpixel ((i,j), 255) Im.save (R ' img.jpg ') def ocrend (): # #识别 image_name = "img.jpg" im = Image.open (image_name) im = Im.filter (imagefilt Er. Medianfilter ()) enhancer = Imageenhance.contrast (im) im = enhancer.enhance (2) im = Im.convert (' 1 ') Im.save ("1. TIF ") printImage_file_to_string (' 1.tif ') if __name__== ' __main__ ': image_name = "1.png" im = Image.open (image_name) IM = Im.filter (imagefilter.detail) im = Im.filter (imagefilter.medianfilter ()) enhancer = Imageenhance.contrast (IM) im = Enhancer.enhance (2) im = Im.convert (' 1 ') # #a =remove_point (IM) Pointmidu (IM) ocrend ()
I this method, the final recognition rate is really not high, write out, which Master has good ideas or practices, hope to enlighten!
Some ideas on using Python for verification code recognition