Some ideas on the identification of verification code in Python

Source: Internet
Author: User
Tags tesseract ocr
Python Plus "Verification Code" for keywords in Baidu Search, you can find a lot of verification code identification of the article. I have a general look, the main methods have several categories: one is through the processing of pictures, and then use the Font feature matching method, a kind of image processing after the establishment of character dictionary, there is a kind of direct use of OCR module to identify. Regardless of the method, you need to first deal with the image, so try to analyze the following verification code.

First, image processing

The main factor in this verification code is the middle curve, first consider removing the curve from the picture. Two kinds of algorithms are considered:
The first is to take the position of the curve head first, that is, the position of the black point when x=0. Then move back the X value, observe the location of the black dots under each X, and determine the distance between the two adjacent black dots, if the distance is within a certain range, you can basically judge the point is the point on the curve, and finally the point on the curve is painted white. Try this method, the results obtained by the picture effect is very general, the curve can not be completely removed, and the capacity of the character line removal.
The second consideration is to calculate the density of points within the unit area. So first calculate the number of points within the unit area, the unit area within the number of points less than a specified number of areas to remove, the remaining part is basically the verification code character part. In this case, for ease of operation, the 5*5 is taken as the unit range and the standard density of the points within the unit area is adjusted to 11. The effect after processing:

Second, character verification

The method I use here is to use Pytesser for OCR recognition, but because of the irregularity of this kind of verification code character, the accuracy of the verification result is not very high. Specifically which Daniel, there is any good way, hope to give advice.

Iii. preparation work and code examples

1, PIL, Pytesser, tesseract

(1) Installation pil: Download address: http://www.pythonware.com/products/pil/
(2) Pytesser: Download address: http://code.google.com/p/pytesser/, download the extract directly under the same Code folder, you can use.
(3) Tesseract OCR engine Download: http://code.google.com/p/tesseract-ocr/, unzip after download, find Tessdata folder, Use it to replace the Pytesser extracted Tessdata folder.

2. Specific code

#encoding =utf-8## #利用点的密度计算import image,imageenhance,imagefilter,imagedrawimport sysfrom pytesser Import *# Count the number of points in the range def numpoint (IM): w,h = Im.sizedata = List (Im.getdata ()) Mumpoint=0for x in range (W): For y in range (h): If data[y  *w + x]!=255: #255是白色mumpoint +=1return mumpoint# Compute the density of points within 5*5 range def pointmidu (IM): w,h = im.sizep=[]for y in range (0,h,5): for X in range (0,w,5): Box = (x, y, x+5,y+5) im1=im.crop (box) a=numpoint (IM1) if a<11:# #如果5 less than 11 points in the range, then all of the parts are replaced with white. For I in Range (x,x+5): for J in Range (y,y+5): Im.putpixel ((i,j), 255) Im.save (R ' img.jpg ') def ocrend (): # #识别image_name = "img . jpg "im = Image.open (image_name) im = Im.filter (imagefilter.medianfilter ()) enhancer = Imageenhance.contrast (IM) im = Enhancer.enhance (2) im = Im.convert (' 1 ') im.save ("1.tif") Print image_file_to_string (' 1.tif ') if __name__== ' __main__ ': Image_name = "1.png" im = Image.open (image_name) im = Im.filter (imagefilter.detail) im = Im.filter ( Imagefilter.medianfilter ()) enhancer = Imageenhance.contrast (im) im = enhancer.enhance (2) im = IM. Convert (' 1 ') # #a =remove_point (IM) Pointmidu (IM) ocrend () 

I this method, the final recognition rate is really not high, write out, which Master has good ideas or practices, hope to enlighten!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.