Http://www.bkjia.com/Pythonjc/1131343.html
Using Jtessboxeditor tools to train Tesseract3.02.02 samples, improve verification code recognition rate, tesseract training samples
Using Jtessboxeditor tools to train Tesseract3.02.02 samples, improve verification code recognition rate, tesseract training samples
1. Background
The previous article has briefly introduced the installation and basic use of the Tesseract OCR engine, which mentions the use of the-L eng parameter to limit the language library, which can improve the recognition accuracy and recognition efficiency.
In this paper, the verification Code of a website for sample training, to form their own language library, to improve the identification rate of verification code.
2. Preparation Tools
Tesseract Sample Training There is an official process description, https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract# Run-tesseract-for-training, but are all in English, personally think this address is suitable for finding the details of the problem, the whole e-text to the public still have some difficulties.
There are two ways to do this: 1-using three-way tools, 2-full command-line operations, three-party tools are mainly downloaded in Https://github.com/tesseract-ocr/tesseract/wiki/AddOns, This article will use jtessboxeditor this tool, we first download to his local.
Specifically, this tool is run on a Java virtual machine, so we also download and install a Java Virtual machine: http://download.oracle.com/otn-pub/java/jdk/8u91-b14/ Jdk-8u91-windows-x64.exe? Authparam=1463733597_1161f2d895aa7606ed260b43b83d5f86.
To summarize:
1. Tool 2 Java Virtual machine Ver 1.8.0_91 64-bit version (Oracle website)
2, tool 1 Jtessboxeditor Ver 1.5 version (Jtessboxeditor official website), run the interface as follows:
3. Use example 1), prepare sample picture
Manually refresh a website verification code, manual or write program, saved 101 code sample files, respectively named as: 1.png,2.png,......,101.png.
The verification code has several characteristics: A, fixed length 4 bits, B, are numbers, C, there is background interference, but relatively simple, d, the font is red.
In order to improve the recognition rate, first done a work is grayscale processing, and all converted to TIF files, respectively named as: 1.tif,2.tif,......,101.tif, unified storage under the D:\PYTHON\LNYPCG.
2), Merge sample pictures
Open Jtessboxeditor, click Tools->merge Tiff, Shift-Select the 101 TIF files mentioned earlier, and merge the generated TIF into the new directory d:\python\lnypcg\new. Name is Langyp.fontyp.exp0.tif.
Note: Langyp is the name of the language I define, Fontyp is the name of the font I define, follow-up will be used, you can change to your favorite name .
3), Generate box file
Execute command to generate langyp.fontyp.exp0.box file
Tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 batch.nochop Makebox
D:\python\lnypcg\new>tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 batch.nochop Makeboxtesseract Open Source OCR Engine v3.02 with leptonicapage 1 of 101Page 2 of 101Page 3 of 101 ... Page 101 of 101d:\python\lnypcg\new>dir Drive D does not have a label on the volume. The serial number of the volume is the directory of 36d9-cdc7 D:\python\lnypcg\new 2016-06-03 14:37 <DIR> . 2016-06-03 14:37 < Dir>. 2016-06-03 14:30 6,327 langyp.fontyp.exp0.box2016-06-03 13:07 126,056 Langyp.fontyp.exp0.tif 2 files 132,383 bytes 2 directories 24,869,994,496 Available bytes
4), modify the box file
Switch to the Box editor page of the Jtessboxeditor tool and click Open, opening the previous TIFF file Langyp.fontyp.exp0.tif, the tool will automatically load the corresponding box file.
Check the box data, as shown, the number 8 is mistaken for the letter H, manually modify H into 8, and save.
Click the Red box button, check the TIF file box data, check the end and save.
5), Generate Font_properties
Execute the echo command to generate the font_properties.
Echo Fontyp 0 0 0 0 0 >font_properties
You can also manually create a new text file named Font_properties (Note that the file does not have an extension), the content is the font name Fontyp, followed by 5 0, respectively, representing the font of bold, italic and other attributes, here are all 0
D:\python\lnypcg\new>echo Fontyp 0 0 0 0 0 >font_propertiesd:\python\lnypcg\new>type Font_propertiesfontyp 0 0 0 0 0
6), Generate training files
Execute command, generate langyp.fontyp.exp0.tr training file
Tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 Nobatch Box.train
D:\python\lnypcg\new>tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 Nobatch box.traintesseract Open Source OCR Engine v3.02 with leptonicapage 1 of 101row xheight=8.66667, but median xheight = 10apply_boxes:boxes r EAD from Boxfile:4 Found 4 good blobs. Generated training data for 1 words ....... ..... Page 101 of 101row xheight=8.66667, but median xheight = 10apply_boxes:boxes read from Boxfile:4 Found 4 good BLOBs. Generated training data for 1 words D:\python\lnypcg\new directory 2016-06-03 16:34 <DIR>. 2016-06-03 16:34 <DIR>.. 2016-06-03 16:05 font_properties2016-06-03 14:30 6,327 langyp.fontyp.exp0.box2016-06-03 1 3:07 126,056 Langyp.fontyp.exp0.tif2016-06-03 16:20 618,844 langyp.fontyp.exp0.tr2016-06-03 16:20 202 langyp.fontyp.exp0.txt 5 files 751,445 bytes 2 directories 24,86 9,101,568 bytes Available
7), generate character Set file
Executes a command that generates a character set file named Unicharset.
Unicharset_extractor Langyp.fontyp.exp0.box
D:\python\lnypcg\new>unicharset_extractor langyp.fontyp.exp0.boxExtracting Unicharset from Langyp.fontyp.exp0.boxWrote unicharset file./unicharset. The volume in D:\python\lnypcg\new>dir drive D is not labeled. The serial number of the volume is the directory of 36d9-cdc7 D:\python\lnypcg\new 2016-06-03 16:41 <DIR> . 2016-06-03 16:41 < Dir>. 2016-06-03 16:05 font_properties2016-06-03 14:30 6,327 langyp.fontyp.exp0.box2016-06-03 13:07 126,056 langyp.fontyp.exp0.tif2016-06-03 16:20 618,844 langyp.fontyp.exp0.tr2016-06-03 16:20 202 langyp.fontyp.exp0.txt2016-06-03 16:41 712 unicharset 6 files 752,157 bytes 2 directories 24,869,171,200 bytes available
8). Create a shape file
Execute command to generate a shape file
Shapeclustering-f font_properties-u unicharset-o Langyp.unicharset langyp.fontyp.exp0.tr
D:\python\lnypcg\new>shapeclustering-f font_properties-u Unicharset-o Langyp.unicharset Langyp.fontyp.exp0.trReading langyp.fontyp.exp0.tr ... Building master shape tablecomputing shape distances ... Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0StoppEd with 0 merged, Min Dist 999.000000Computing shape distances ... Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... Stopped with 0 merged, Min Dist 999.000000Computing shape distances ... 0 1 2 3 4 5 6 7 8 9 10Stopped with 0 merged, Min Dist 0.057803Master shape_table:number of shapes = one max Unichars = 1 nu Mber with multiple Unichars = 0d:\python\lnypcg\new>dir the volume in drive D is not labeled. The serial number of the volume is the directory of 36d9-cdc7 D:\python\lnypcg\new 2016-06-03 17:24 <DIR>. 2016-06-03 17:24 <DIR> .. 2016-06-03 17:20 font_properties2016-06-03 14:30 6,327 langyp.fontyp.exp0.box2016-06-03 1 3:07 126,056 langyp.fontyp.exp0.tif2016-06-03 17:23 618,844 langyp.fontyp.exp0.tr2016-06-03 17:23 202 langyp.fontyp.exp0.txt2016-06-03 17:24 723 Langyp.unicharset2016-06-03 17:24 202 shapetable2016-06-03 17:24 712 unicharset 8 files 753,085 bytes 2 directories 24,868,278,272 available Bytes
9), generate the clustered character signature file
Executes the command to generate 3 feature character files, Unicharset, inttemp, pffmtable
Mftraining-f font_properties-u unicharset-o Langyp.unicharset langyp.fontyp.exp0.tr
D:\python\lnypcg\new>mftraining-f font_properties-u unicharset-o Langyp.unicharset langyp.fontyp.exp0.trRead Shape table shapetable of shapesreading langyp.fontyp.exp0.tr ... done!
10), generate character normalization signature file
Executes a command that generates a normalized signature file, Normproto.
Cntraining langyp.fontyp.exp0.tr
D:\python\lnypcg\new>cntraining langyp.fontyp.exp0.trReading langyp.fontyp.exp0.tr ... Clustering ...
11), renaming
Execute the command to rename the signature file generated by step 9, step 10.
Rename Normproto Fontyp.normproto
Rename Inttemp fontyp.inttemp
Rename Pffmtable fontyp.pffmtable
Rename Unicharset Fontyp.unicharset
Rename Shapetable fontyp.shapetable
D:\python\lnypcg\new>rename Normproto fontyp.normprotod:\python\lnypcg\new>rename inttemp fontyp.inttempD:\ Python\lnypcg\new>rename pffmtable fontyp.pffmtabled:\python\lnypcg\new>rename Unicharset Fontyp.unicharsetD : \python\lnypcg\new>rename shapetable fontyp.shapetable
12). Merging training documents
Executes the command to generate the Fontyp.traineddata file.
Combine_tessdata Fontyp.
Attention:
A, fontyp.traineddata files will eventually be copied to the Tesseract installation directory Tessdata directory, can be tesseract found.
b, the command line must take a point at the end.
C, the execution results, 1,3,4,5,13 these lines must have a value, only to represent the success of the command execution.
D:\python\lnypcg\new>combine_tessdata fontyp.combining tessdata filestessdatamanager combined tesseract data files. Offset for type 0 is-1offset for type 1is $ offset for type 2 is-1offset for Type 3 was 852Offset for type 4 is 137760Offset for type 5 are 137850Offset for type 6 is-1offset for type 7 is-1offset for type 8 is-1offset fo R Type 9 Is-1offset for type ten is-1offset for type one is-1offset for type, is-1Offset for type, 139352offset for type, is-1offset for type, is-1offset for type, is-1
13) test Use
For example, 28.tif in the previous article 8 was mistaken for the letter S, with a new font to see if the error.
d:\python\lnypcg>tesseract 28.tif output-l eng-psm 7Tesseract Open Source OCR Engine v3.02 with Lep Tonicad:\python\lnypcg>type output.txts094 #1调用默认的eng语言, 8 is recognized as S d:\python\lnypcg>tesseract 28.tif output-l FONTYP-PSM 7Error opening data file C:\Program Files (x86) \tesseract-ocr\tessdata/fontyp.traineddataple ASE Make sure the TESSDATA_PREFIX environment variable are set to the parent directory of your "Tessdata" directory. Failed loading language ' Fontyp ' tesseract couldn ' t load any languages! Could not initialize tesseract. #2条用新的fontyp语言, Tesseract cannot find the Fontyp language. d:\python\lnypcg>copy. \new\fontyp.traineddata "C:\Program Files (x86) \tesseract-ocr\tessdata" 1 words copied Thing #3复制 Fontyp.traineddata to Tesseract in the Tessdata subdirectory of the installation directory
D:\python\lnypcg>tesseract 28.tif output-l fontyp-psm 7Tesseract Open Source OCR Engine v3.02 with Leptonicad:\python \lnypcg>type output.txt8094
#使用fontyp语言成功识别8094
4. Summary:
Anyway,jtessboxeditor tool is actually a basic molding of the three-party sample training tool, its function is to automatically execute the above script command, but in the actual use, there is still a lack of perfect place, such as can not add PSM parameters, generate shape when the frequent program abnormal crash, Therefore, the operation of this article or command-line-based.
Tesseract is a very powerful OCR engine, especially after the targeted training, verification code recognition rate can almost reach more than 95%, and then add some judgment mechanism in the program, basically can meet the crawler automatic landing requirements, back to write a certain east automatic identification code of the crawler.
Simplify the previous article and combine it into the following list of steps:
1. Merge picture 2, generate box file tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l ENG-PSM 7 batch.nochop makebox3, modify box file 4, Generate font _propertiesecho Fontyp 0 0 0 0 0 >font_properties5, generating training files tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0-l eng-ps M 7 Nobatch Box.train6, generating character set file Unicharset_extractor Langyp.fontyp.exp0.box 7, generating shape file shapeclustering-f font_ Properties-u unicharset-o Langyp.unicharset Langyp.fontyp.exp0.tr8, generating a clustered character file Mftraining-f font_properties-u Unicharset-o Langyp.unicharset LANGYP.FONTYP.EXP0.TR9, generating character normalization feature file cntraining langyp.fontyp.exp0.tr10, renaming rename Normproto fontyp.normprotorename inttemp fontyp.inttemprename pffmtable fontyp.pffmtable Rename Unicharset Fontyp.unicharsetrename shapetable fontyp.shapetable11, merge training files, generate Fontyp.traineddatacombine_tessdata Fontyp.
Above!
Using Jtessboxeditor tools to train Tesseract3.02.02 samples, improve verification code recognition rate, tesseract training samples