Some ideas on using Python for verification code recognition
Python Plus "Verification Code" for keywords in Baidu Search, you can find a lot of verification code identification of the article. I have a general look, the main methods have several categories: one is through the processing of pictures, and then use the Font feature matching method, a kind of image processing after the establishment of character dictionary, there is a kind of direct use of OCR module to identify. Regardless of the method, you need to first deal with the image, so try to analyze the following verification code.
First, image processing
The main factor in this verification code is the middle curve, first consider removing the curve from the picture. Two kinds of algorithms are considered:
The first is to take the position of the curve head first, that is, the position of the black point when x=0. Then move back the X value, observe the location of the black dots under each X, and determine the distance between the two adjacent black dots, if the distance is within a certain range, you can basically judge the point is the point on the curve, and finally the point on the curve is painted white. Try this method, the results obtained by the picture effect is very general, the curve can not be completely removed, and the capacity of the character line removal.
The second consideration is to calculate the density of points within the unit area. So first calculate the number of points within the unit area, the unit area within the number of points less than a specified number of areas to remove, the remaining part is basically the verification code character part. In this case, for ease of operation, the 5*5 is taken as the unit range and the standard density of the points within the unit area is adjusted to 11. The effect after processing:
Second, character verification
The method I use here is to use Pytesser for OCR recognition, but because of the irregularity of this kind of verification code character, the accuracy of the verification result is not very high. Specifically which Daniel, there is any good way, hope to give advice.
iii. Preparation Work and code examples
1, PIL, Pytesser, tesseract
(1) Installation pil::http://www.pythonware.com/products/pil/
(2) pytesser::http://code.google.com/p/pytesser/, after the download decompression directly placed in the same Code folder, you can use.
(3) Tesseract OCR engine Download: http://code.google.com/p/tesseract-ocr/, unzip after download, find Tessdata folder, Use it to replace the Pytesser extracted Tessdata folder.
2. Specific code
#Encoding=utf-8## #利用点的密度计算ImportImage,imageenhance,imagefilter,imagedrawImportSysFrom PytesserImport *#Calculate the number of points within a rangeDefNumpoint (IM): W,h =Im.size data =List (Im.getdata ()) mumpoint=0For XInchRange (W):For YInchRange (h):If data[y*w + x]!=255:#255 is white mumpoint+=1ReturnMumpoint#Calculates the density of points within the 5*5 rangeDefPointmidu (IM): W,h =Im.size p=[]For YIn range (0,h,5):For XIn range (0,w,5): Box = (x, y, x+5,y+5) im1=Im.crop (Box) a=Numpoint (IM1)If a<11:##如果5 is less than 11 points in the range, change the portion to white.For IIn range (x,x+5):For JIn range (y,y+5): Im.putpixel ((i,j), 255) Im.save (R‘Img.jpg‘)Def ocrend ():##识别 image_name ="Img.jpg"im =Image.open (image_name) im =Im.filter (Imagefilter.medianfilter ()) enhancer =Imageenhance.contrast (IM) im = Enhancer.enhance (2) im = Im.convert (‘1‘) Im.save ("1.tif")Print image_file_to_string (‘1.tif‘)If__name__==‘__main__ : Image_name = "1.png" im = Image.open (image_name) im = Im.filter ( Imagefilter.detail) im = Im.filter (Imagefilter.medianfilter ()) Enhancer = Imageenhance.contrast (IM) im = enhancer.enhance (2) im = Im.convert ( '
I this method, the final recognition rate is really not high, write out, which Master has good ideas or practices, hope to enlighten!
Python Verification Code identification