in order to improve the recognition rate of tesseract library, it can be trained in Chinese characters.
1. Install Tesseract first. Note Here to install, because the installed program contains other training programs, the compiled version does not have these tools.
2. Download the Jtessboxeditor tool. This tool is written in Java and requires the JRE to run. This tool is mainly used to modify the box file to proofread text. The following figure is the directory of the tool, directly click on the red box to run the program.
This preparation allows the library to recognize the cancellation of these two words, prepared a 5 map:
3. To generate the files in TIF format
It is best to put the pictures in the Tesseract Library's installation directory, and then do the work in this directory. Click the merge TIFF in the Jtessboxeditor tools button. Then select all 5 of our samples and click Open. This will pop up another save dialog box, is the TIF file we want, for TIF file naming rules [lang]. [Fontname].exp[num].tif. Where Lang is a language, FontName is a font. According to their own needs set. Click Save, this time the directory will have our TIF files.
4. Generate Box File
first open the command line, enter the Tesseract directory, enter the command: Tesseract.exe chi . Myself.exp0.tif chi.myself . exp0 batch.nochop Makebox
5. Proofing text
use Jtessboxeditor to open the TIF file you just generated
we will find that the information displayed in the text is incorrect.
we need to correct all the characters in the Char catalogue of each picture. Now the Tesseract Library will be recognized as four parts, so there are 1,2,3,4 four lines, we need it to be calibrated to two lines, and the character should be canceled. Follow these steps:
This time the two parts are together. But char this column shows H and should be changed to fetch. Follow these steps:
other characters are the same, and the final effect is this:
I have a total of 5 pictures, after they have been changed, click Save. At this time we can look at the Chi.myself.exp0.box file (Notepad open), will find that there is a correction.
Note: This step correction tool can also be used directly in the box file, but error prone.
6. To generate a. tr file
Tesseract.exe chi.myself.exp0.tif chi.myself.exp0 nobatch box.train
7. Generate Unicharset files.
Unicharset_extractor Chi.myself.exp0.box
7. New Font_properties File
Create a new plaintext Font_properties file in Notepad with the following format:
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
Use Notepad like: myself 0 0 0 0 0 Remember is 5 0.
7. Run the following three commands:
Shapeclustering.exe-f font_properties.txt-u Unicharset chi.myself.exp0.tr
Mftraining.exe-f font_properties.txt-u unicharset-o Unicharset chi.myself.exp0.tr
Cntraining.exe chi.myself.exp0.tr
8. Renaming
Add Normproto to the five files in the Unicharset, Inttemp, Pffmtable, shapetable, and myself of the catalogue. Notice a little. The following figure:
Execute command
Combine_tessdata myself.
Generate this file, which means that we have succeeded.
Copy the file into the Tessdata file, and you can test it using the