Some ideas on using Python for verification code recognition

Source: Internet
Author: User
Tags tesseract ocr

Reprint: @ Xiao Wu Yi Http://www.cnblogs.com/xiaowuyi


Python Plus "Verification Code" for keywords in Baidu Search, you can find a lot of verification code identification of the article. I have a general look, the main methods have several categories: one is through the processing of pictures, and then use the Font feature matching method, a kind of image processing after the establishment of character dictionary, there is a kind of direct use of OCR module to identify. Regardless of the method, you need to first deal with the image, so try to analyze the following verification code.
First, image processing


The main factor in this verification code is the middle curve, first consider removing the curve from the picture. Two kinds of algorithms are considered:
The first is to take the position of the curve head first, that is, the position of the black point when x=0. Then move back the X value, observe the location of the black dots under each X, and determine the distance between the two adjacent black dots, if the distance is within a certain range, you can basically judge the point is the point on the curve, and finally the point on the curve is painted white. Try this method, the results obtained by the picture effect is very general, the curve can not be completely removed, and the capacity of the character line removal.
The second consideration is to calculate the density of points within the unit area. So first calculate the number of points within the unit area, the unit area within the number of points less than a specified number of areas to remove, the remaining part is basically the verification code character part. In this case, for ease of operation, the 5*5 is taken as the unit range and the standard density of the points within the unit area is adjusted to 11. The effect after processing:



Second, character verification
The method I use here is to use Pytesser for OCR recognition, but because of the irregularity of this kind of verification code character, the accuracy of the verification result is not very high. Specifically which Daniel, there is any good way, hope to give advice.
III. Preparation Work and code examples
1, PIL, Pytesser, tesseract
(1) Installation pil::http://www.pythonware.com/products/pil/
(2) pytesser::http://code.google.com/p/pytesser/, after the download decompression directly placed in the same Code folder, you can use.
(3) Tesseract OCR engine Download: http://code.google.com/p/tesseract-ocr/, unzip after download, find Tessdata folder, Use it to replace the Pytesser extracted Tessdata folder.
2. Specific code

#encoding =utf-8## #利用点的密度计算import image,imageenhance,imagefilter,imagedrawimport sysfrom pytesser Import *# Count the number of points within the range def numpoint (IM): w,h = im.size data = List (Im.getdata ()) mumpoint=0 for X in range (W): For Y in range (h): if data[y*w + x]!=255: #255是白色 mumpoint+=1 return Mumpoint #计            Calculates the density of points within the 5*5 range def pointmidu (IM): W,h = Im.size p=[] for y in range (0,h,5): For x in Range (0,w,5): box = (x, y, x+5,y+5) im1=im.crop (box) a=numpoint (IM1) if a<11:# #如果5 less than 11 points in the range, then the part is fully                Part to White. For I in Range (x,x+5): for J in Range (y,y+5): Im.putpixel ((i,j), 255) Im.save (R ' img.jpg ') def ocrend (): # #识别 image_name = "img.jpg" im = Image.open (image_name) im = Im.filter (imagefilt Er. Medianfilter ()) enhancer = Imageenhance.contrast (im) im = enhancer.enhance (2) im = Im.convert (' 1 ') Im.save ("1. TIF ") printImage_file_to_string (' 1.tif ') if __name__== ' __main__ ': image_name = "1.png" im = Image.open (image_name) IM     = Im.filter (imagefilter.detail) im = Im.filter (imagefilter.medianfilter ()) enhancer = Imageenhance.contrast (IM) im = Enhancer.enhance (2) im = Im.convert (' 1 ') # #a =remove_point (IM) Pointmidu (IM) ocrend ()

I this method, the final recognition rate is really not high, write out, which Master has good ideas or practices, hope to enlighten!

Some ideas on using Python for verification code recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.