Use the jTessBoxEditor tool for Tesseract3.02.02 sample training to improve the verification code recognition rate and tesseract training samples.
1. Background
The previous article briefly introduced the installation and basic use of the tesseract ocr engine. It mentioned that using the-l eng parameter to limit the language library can improve the recognition accuracy and efficiency.
This article will conduct sample training for a website's verification code to form its own language library to improve the verification code recognition rate.
2. Prepare tools
An official process for tesseract sample training is described.
There are two specific methods: 1. Use the third-party tool and 2. Use the full command line tool.
It should be noted that this tool is running Based on Java virtual machine, so we also need to download and install a Java virtual machine,: http://download.oracle.com/otn-pub/java/jdk/8u91-b14/jdk-8u91-windows-x64.exe? AuthParam = 1463733597_4151f2d895aa7606ed260b43b83d5f86.
Summary:
1. Tool 2 Java Virtual Machine Ver 1.8.0 _ 91 64-bit version (oracle official website)
2. tool 1 jtessboxeditor Ver 1.5 (jtessboxeditor official website). The running interface is as follows:
3. Use instance 1). Prepare the sample image
Manually refresh the verification code of a website, manually or write a program, and save 101 sample verification code files named 1.png, 2.png,…, respectively ,......, 101. png.
The verification code has several features: a, 4-digit fixed length, B, all numbers, c, background interference, but relatively simple, d, the font is red.
In order to improve the recognition rate, the first task is grayscale processing, and all the files are converted into tif files named 1.tif, 2.tif,…, respectively ,......, 101. tif is stored in d: \ python \ lnypcg.
2) Merge sample images
Open jtessboxeditor, Click Tools> Merge Tiff, press and hold shift to select the first tif file mentioned above, and Merge the generated tif into the new Directory d: \ python \ lnypcg \ new, named langyp. fontyp. excomputif.
Note: langyp is the language name defined by myself, and fontyp is the font name defined by myself, which will be used later. You can change it to your favorite name..
3) generate the box file
Execute Command generationLangyp. fontyp. exdomainboxFile
Tesseract langyp. fontyp. exstmtif langyp. fontyp. exp0-l eng-psm 7 batch. nochop makebox
D: \ python \ lnypcg \ new> tesseract langyp. fontyp. excomputif langyp. fontyp. exp0-l eng-psm 7 batch. nochop makeboxTesseract Open Source OCR Engine v3.02 with LeptonicaPage 1 of 101 Page 2 of 101 Page 3 of 101 ...... The volume in Page 101 of 101D: \ python \ lnypcg \ new> dir drive D has no labels. The serial number of the volume is 36D9-CDC7 D: \ python \ lnypcg \ new directory <DIR>. 2016-06-03 <DIR> ..6,327 langyp. fontyp. exdomainbox126,056 langyp. fontyp. exstmtif 2 files 132,383 bytes 2 directories 24,869,994,496 available bytes
4) modify the box file
Switch to the Box Editor page of The jTessBoxEditor tool and click open to open the preceding tiff File langyp. fontyp. exp0.tif. The tool automatically loads the corresponding box file.
Check box data, as shown in. The number 8 is mistakenly recognized as the letter H, manually changed to 8, and saved.
Click the button in the red box to check box data of the tif file one by one. Check and save all the data.
5) generate font_properties
Run the echo command to generate font_properties.
Echo fontyp 0 0 0 0> font_properties
You can also manually create a text file named font_properties (note that the file does not have an extension). The content is fontyp, followed by 5 zeros, representing attributes such as bold and italic fonts, here all are 0
D:\python\lnypcg\new>echo fontyp 0 0 0 0 0 >font_propertiesD:\python\lnypcg\new>type font_propertiesfontyp 0 0 0 0 0
6) generate Training Files
Run the command to generate the langyp. fontyp. ex1_tr training file.
Tesseract langyp. fontyp. exstmtif langyp. fontyp. exp0-l eng-psm 7 nobatch box. train
D: \ python \ lnypcg \ new> tesseract langyp. fontyp. excomputif langyp. fontyp. exp0-l eng-psm 7 nobatch box. trainTesseract Open Source OCR Engine v3.02 with LeptonicaPage 1 of 101row xheight = 8.66667, but median xheight = 10APPLY_BOXES: Boxes read from boxfile: 4 Found 4 good blobs. generated training data for 1 words .................. Page 101 of 101row xheight = 8.66667, but median xheight = 10APPLY_BOXES: Boxes read from boxfile: 4 Found 4 good blobs. generated training data for 1 words D: \ python \ lnypcg \ new directory <DIR>. <DIR> .. font_properties2016-06-03 6,327 langyp. fontyp. exdeskbox2016-06-03 126,056 langyp. fontyp. excomputif618,844 langyp. fontyp. ex1_tr202 langyp.fontyp.exw.txt 5 files 751,445 bytes 2 directories 24,869,101,568 available bytes
7) generate character set files
Run the command to generate a character set file named unicharset.
Unicharset_extractor langyp. fontyp. ex1_box
D: \ python \ lnypcg \ new> unicharset_extractor langyp. fontyp. exdeskboxextracting unicharset from langyp. fontyp. excluboxwrote unicharset file. /unicharset. the volume in D: \ python \ lnypcg \ new> dir drive D has no labels. The serial number of the volume is 36D9-CDC7 D: \ python \ lnypcg \ new directory <DIR>. <DIR> .. font_properties2016-06-03 6,327 langyp. fontyp. exdeskbox2016-06-03 126,056 langyp. fontyp. exp0.2016 2016-06-03 618,844 langyp. fontyp. exp0.tr2016-06-03 202 langyp.fontyp.ex1_txt712 unicharset6 files, 752,157 bytes, 2 directories, 24,869,171,200 available bytes
8) generate a shape File
Run the command to generate a shape file.
Shapeclustering-F font_properties-U unicharset-O langyp. unicharset langyp. fontyp. ex1_tr
D: \ python \ lnypcg \ new> shapeclustering-F font_properties-U unicharset-O langyp. unicharset langyp. fontyp. exclutrreading langyp. fontyp. exabytr... building master shape tableComputing shape distances... stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape Distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, mi N dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... stopped with 0 merged, min dist 999.000000 Computing shape distances... stopped with 0 merged, min dist 999.000000 Computing shape distances ... 0 1 2 3 4 5 6 7 8 9 10 Stopped with 0 merged, min dist 0.057803 Master shape_table: Number of shapes = 11 max unichars = 1 number with multiple unichars = 0D: the volume in \ python \ lnypcg \ new> dir drive D has no labels. The serial number of the volume is 36D9-CDC7 D: \ python \ lnypcg \ new directory <DIR>. <DIR> .. font_properties2016-06-03 6,327 langyp. fontyp. exdeskbox2016-06-03 126,056 langyp. fontyp. exp0.2016 2016-06-03 618,844 langyp. fontyp. exp0.tr2016-06-03 202 langyp.fontyp.ex1_txt 723 langyp. unicharset202 shapetable712 unicharset 8 files 753,085 bytes 2 directories 24,868,278,272 available bytes
9) generate the feature file of the clustered characters
Run the command to generate three feature character files: unicharset, inttemp, and pffmtable.
Mftraining-F font_properties-U unicharset-O langyp. unicharset langyp. fontyp. ex1_tr
D:\python\lnypcg\new>mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.trRead shape table shapetable of 11 shapesReading langyp.fontyp.exp0.tr ...Done!
10) generate character normalization feature files
Run the command to generate the normalization feature file normproto.
Cntraining langyp. fontyp. exmo-tr
D:\python\lnypcg\new>cntraining langyp.fontyp.exp0.trReading langyp.fontyp.exp0.tr ...Clustering ...
11) Rename
Run the command to rename the feature file generated in Step 9 and Step 10.
Rename normproto fontyp. normproto
Rename inttemp fontyp. inttemp
Rename pffmtable fontyp. pffmtable
Rename unicharset fontyp. unicharset
Rename shapetable fontyp. shapetable
D:\python\lnypcg\new>rename normproto fontyp.normprotoD:\python\lnypcg\new>rename inttemp fontyp.inttempD:\python\lnypcg\new>rename pffmtable fontyp.pffmtableD:\python\lnypcg\new>rename unicharset fontyp.unicharsetD:\python\lnypcg\new>rename shapetable fontyp.shapetable
12) Merge Training Files
Run the command to generate the fontyp. traineddata file.
Combine_tessdata fontyp.
Note:
A. The fontyp. traineddata file must be copied to the tessdata directory of the tesseract installation directory before it can be found by tesseract.
B. the command line must end with a vertex.
C. In the execution result, the rows 1, 3, 4, 5, and 13 must have numerical values to indicate that the command is successfully executed.
D:\python\lnypcg\new>combine_tessdata fontyp.Combining tessdata filesTessdataManager combined tesseract data files.Offset for type 0 is -1Offset for type 1 is 140Offset for type 2 is -1Offset for type 3 is 852Offset for type 4 is 137760Offset for type 5 is 137850Offset for type 6 is -1Offset for type 7 is -1Offset for type 8 is -1Offset for type 9 is -1Offset for type 10 is -1Offset for type 11 is -1Offset for type 12 is -1Offset for type 13 is 139352Offset for type 14 is -1Offset for type 15 is -1Offset for type 16 is -1
13) test and use
For example, in the previous article 28. tif, 8 was mistakenly identified as the letter S, and a new font was used to check whether there were still errors.
D: \ python \ lnypcg> tesseract 28.tif output-l eng-psm 7 Tesseract Open Source OCR Engine v3.02 with LeptonicaD: \ python \ lnypcg> type output.txt S094#1 call the default eng language, and 8 is recognized as SD: \ python \ lnypcg> tesseract 28.tif output-l fontyp-psm 7 Error opening data file C: \ Program Files (x86) \ Tesseract-OCR \ tessdata/fontyp. traineddataPlease make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. failed loading language 'fontyp' Tesseract couldn't load any ages! Cocould not initialize tesseract.#2 use the new fontyp language, and tesseract cannot find the fontyp language.D: \ python \ lnypcg> copy. \ new \ fontyp. traineddata "C: \ Program Files (x86) \ Tesseract-OCR \ tessdata" one file has been copied.#3 copyFontyp. traineddata to the tessdata subdirectory in the installation directory of tesseract
D: \ python \ lnypcg> tesseract 28.tif output-l fontyp-psm 7 Tesseract Open Source OCR Engine v3.02 with LeptonicaD: \ python \ lnypcg> type output.txt 8094
# Use the fontyp language to successfully identify 8094
4. Conclusion:
Anyway, jtessboxeditor is actually a basic third-party sample training tool, which can automatically execute the preceding script commands. However, in actual use, there are still some imperfections, for example, if you cannot add the psm parameter, the program crashes abnormally when a shape is generated. Therefore, the operation in this article is mainly based on the command line.
Tesseract is a very powerful ocr engine, especially after targeted training, the verification code recognition rate can almost reach more than 95%, and then some judgment mechanisms are added to the program, basically, it can meet the crawler's Automatic Login needs. I will write a crawler program for automatic verification code recognition in East China.
Simplify the previous article to a list of the following steps:
1. Merge images. 2. Generate the box file tesseract langyp. fontyp. excomputif langyp. fontyp. exp0-l eng-psm 7 batch. nochop makebox3, modify box file 4, generate font_propertiesecho fontyp 0 0 0 0> font_properties5, generate training file tesseract langyp. fontyp. excomputif langyp. fontyp. exp0-l eng-psm 7 nobatch box. train6. Generate the character set file unicharset_extractor langyp. fontyp. exshortbox 7. Generate the shape file shapeclustering-F font_properties-U unicharset-O langyp. unicharset langyp. fontyp. exshorttr8: generate the clustered character feature file mftraining-F font_properties-U unicharset-O langyp. unicharset langyp. fontyp. extratr9: generate the character normalization feature file cntraining langyp. fontyp. exshorttr10, rename normproto fontyp. normprotorename inttemp fontyp. inttemprename pffmtable fontyp. pffmtable rename unicharset fontyp. unicharsetrename shapetable fontyp. shapetable11. Combine the training file to generate fontyp. traineddatacombine_tessdata fontyp.
Above!