Since the company needs, recently began to learn verification code identification
I chose TESSERACT-OCR for identification, which is said to be the top three of HP's development, and is now open source. So far, it's 3.0.2.
Of course, we still need to do some work on the verification code to make him more friendly to the machine, so as to improve the recognition rate.
The steps are basically like this
The first step is to perform a grayscale image of the verification code and two value
Need to use the PIL library can be pip download
The code is as follows
defbinarization (image):#turn into a grayscale imageImgry = Image.convert ('L') #two value, threshold can be modified according to the situationThreshold = 128Table= [] forIinchRange (256): ifI <threshold:table.append (0)Else: Table.append (1) out= Imgry.point (table,'1') returnOut
Then the noise, because I study the verification code basically do not need to de-noising, so omit, need to go to the noise of the small partners, please Google yourself.
and tilt adjustment, it is recommended to use a rotational jam algorithm
The principle is to the picture-30 degrees to 30 degrees of rotation, the largest width of the general is positive. (online that said, I tried, for most of the can, a small number of things like C what seems to be the effect of bad)
Normalization
The verification code can be refined by the corrosion algorithm
Corrosion algorithm Please Google yourself.
The second step is to cut the verification code
There are different algorithms for different verification codes.
So far, I've only studied these kinds of
One
Vertical pixel histogram
The principle is to cut according to the number of black blocks per X, the number of black blocks is greater than a certain value to start cutting, less than a certain value end cut. Applicable to the verification code between the interval or large interval, the kind of adhesion together with the verification code effect is not good.
Two
Mean Division Method
The principle is to find the X, y axes and black blocks that appear on the black block without the X, Y axis, cut. Then the average is divided into n equal portions. Applicable to the verification code size is fixed, the adhesion of the verification code effect is better than the previous method.
Three
Trough Division Method
The principle is similar to a vertical pixel histogram, recording the number of black blocks per x, finding local minima, cutting. Applicable to the verification code between the interval or the interval is large, for that adhesion to the verification code effect is better than the vertical pixel histogram.
Four
Drip algorithm
The principle is to simulate the flow of water droplets, record the flow path of water droplets, and then cut. It is important to note that the starting point is determined, and the validation code that sticks together works well.
The above four algorithms I will post the code in another essay
The third step is to identify the verification code.
Finally, it's the highlight.
You need to import Pytesser and call image_to_string (image) to recognize it.
But the recognition rate is really poor.
So we need to train the machines.
The following is a brief introduction to the machine if it is trained.
First download the TESSERACT-OCR, must not how to recognize it right.
Find as many verification codes as possible, preferably two after the value or cut down according to the above steps.
The following excerpt from http://www.cnblogs.com/wolfray/p/5547267.html
For convenience, set the TIF naming format to [Lang]. [Fontname].exp[num].tif
Lang is the language
FontName is a font
Like we're going to train a custom font. EC Fonts Name: Unfont
Then we rename the TIF file Ec.ufont.exp0.tif
Generate A. box file
Tesseract ec.ufont.exp0.tif ec.ufont.exp0 batch.nochop Makebox
Create a. box file using a trained font
Tesseract ec.ufont.exp0.tif ec.ufont.exp0-l Ufont batch.nochop Makebox
1. Create a character signature file. tr
Tesseract ec.ufont.exp0.tif ec.ufont.exp0 Nobatch box.train
This step will produce ec.ufont.exp0.tr file and a ec.ufont.exp0.txt file, txt file seems to be useless, look at the.
2. Calculate character set (generate Unicharset file)
Unicharset_extractor Ec.ufont.exp0.box
3. Defining Font characteristics Files
Versions above-tesseract-ocr3.01 you need to create a font signature file called Font_properties.txt before training.
Create a file manually font_properties.txt
Content such as: Ufont 0 0 0 0 0
Note: This must be consistent with the name in the training name, fill in the following, where the full value is 0, indicating that the font is not bold, italic, and so on.
4. Aggregation character Characteristics
1) shapeclustering-f font_properties.txt-u unicharset ec.ufont.exp0.tr
Note: if font_properties. txt is not added, an error may be
2) mftraining-f font_properties.txt-u unicharset-o ufont.unicharset ec.ufont.exp0.tr
Use the character set file Unicharset produced in the previous step to generate the character set file Ec.unicharset for the current new language. The graphic prototype file inttemp and the characters corresponding to each character are also generated.
Signature file pffmtable. The most important thing is this inttemp file, which contains all the graphical prototypes of the words that need to be produced.
3) cntraining ec.ufont.exp0.tr
This step produces a Character shape normalization feature file Normproto.
Shapeclustering operation is not necessary, if this step is not performed, it will be done automatically when mftraining.
5. Change the name
The directory of Unicharset, inttemp, pffmtable, shapetable, Normproto Five files are added to the front of Ufont.
6. Execute combine_tessdata ufont.
And put the ufont.traineddata in the Tessdata directory.
7. Testing
It must be determined that the data of type 1, 3, 4, 5 is not-1, then a new dictionary is generated.
Tesseract ec.ufont.exp0.tif papapa-l Ufont
Tesseract is also proposed to be used in conjunction with multiple language training libraries. In this way, the new font training library can also be used in conjunction with the original data Training library. such as parameter-L after tesseract input.tif output-l Eng+newfont.
Cntraining and mftraining can only use up to 32. tr files, so for the same font, you have to combine all of the file cat together in a single font from multiple languages, in a separate way, to bring all of the files to 32 languages. The cntraining/mftraining and unicharset_extractor command-line tools must each be individually filtered by the given. TR and. box files, in the same order, for different fonts. You can provide a program to do the above and pick out the same character set in the character set table. This will make things easier.
When you write a batch-processing bat command, you have the flexibility to use the Fill function in Excel.
Here thanks to the many great gods on the website of the answer and record, to my study played a great role. Thank you.
A little experience of python+tesseract verification code recognition