In the actual use of the TESSERACT-ORC recognition Library, the first production of the identification database is likely to be less than ideal recognition rate, need to slowly add later
This article shows how to combine multiple modified box files into a single recognition library.
First, you need a picture sample. tif file, location file. box, as long as you have these two files in, you can merge the dictionary
Suppose you already have the following sample picture and a modified box file:
Image.font.1.tif Image.font.1.box
Image.font.2.tif Image.font.2.box
Image.font.3.fit Image.font.3.box
1. tr file corresponding to Mr.
Tesseract image.font.1.tif image.font.1 Nobatch box.train
Tesseract image.font.2.tif image.font.2 Nobatch box.train
Tesseract image.font.3.tif image.font.3 Nobatch box.train
2. Extracting characters
Unicharset_extractor Image.font.1.box Image.font.2.box Image.font.3.box
3. Create a font signature file
Create a new Font_properties file (note that there is no suffix) to add all the font features of the box file
Font 0 0 0 0 0
4. Execute the following command
Mftraining-f font-u unicharset image.font.1.tr image.font.2.tr image.font.3.tr
5. Gather all. tr files
Cntraining image.font.1.tr image.font.2.tr image.font.3.tr
6. Renaming files
Rename the following file, add the font name in front, I use "CK" here
Unicharset
Inttemp
Normproto
Pfftable
Shapetable----Many tutorials have leaked this file, do not change this file to create a recognition library when the error.
After renaming, the file names are as follows
Ck.unicharset
Ck.inttemp
Ck.normproto
Ck.pfftable
Ck.shapetable----Many tutorials have leaked this file, do not change this file to create a recognition library when the error.
7. Merge all files to generate a large font file
Combine_tessdata CK.
Finish the call.
Tesseract-orc Merge Recognition Results