TESSERACT-OCR Training Chinese __tesseract

Source: Internet
Author: User

in order to improve the recognition rate of tesseract library, it can be trained in Chinese characters.

1. Install Tesseract first. Note Here to install, because the installed program contains other training programs, the compiled version does not have these tools.


2. Download the Jtessboxeditor tool. This tool is written in Java and requires the JRE to run. This tool is mainly used to modify the box file to proofread text. The following figure is the directory of the tool, directly click on the red box to run the program.


This preparation allows the library to recognize the cancellation of these two words, prepared a 5 map:


3. To generate the files in TIF format

It is best to put the pictures in the Tesseract Library's installation directory, and then do the work in this directory. Click the merge TIFF in the Jtessboxeditor tools button. Then select all 5 of our samples and click Open. This will pop up another save dialog box, is the TIF file we want, for TIF file naming rules [lang]. [Fontname].exp[num].tif. Where Lang is a language, FontName is a font. According to their own needs set. Click Save, this time the directory will have our TIF files.




4. Generate Box File

first open the command line, enter the Tesseract directory, enter the command: Tesseract.exe chi . Myself.exp0.tif chi.myself . exp0 batch.nochop Makebox



5. Proofing text

use Jtessboxeditor to open the TIF file you just generated


we will find that the information displayed in the text is incorrect.


we need to correct all the characters in the Char catalogue of each picture. Now the Tesseract Library will be recognized as four parts, so there are 1,2,3,4 four lines, we need it to be calibrated to two lines, and the character should be canceled. Follow these steps:



This time the two parts are together. But char this column shows H and should be changed to fetch. Follow these steps:


other characters are the same, and the final effect is this:


I have a total of 5 pictures, after they have been changed, click Save. At this time we can look at the Chi.myself.exp0.box file (Notepad open), will find that there is a correction.


Note: This step correction tool can also be used directly in the box file, but error prone.

6. To generate a. tr file

Tesseract.exe  chi.myself.exp0.tif chi.myself.exp0  nobatch box.train

7. Generate Unicharset files.

Unicharset_extractor Chi.myself.exp0.box



7. New Font_properties File

Create a new plaintext Font_properties file in Notepad with the following format:

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

Use Notepad like: myself 0 0 0 0 0 Remember is 5 0.

7. Run the following three commands:

Shapeclustering.exe-f font_properties.txt-u Unicharset chi.myself.exp0.tr
Mftraining.exe-f font_properties.txt-u unicharset-o Unicharset chi.myself.exp0.tr
Cntraining.exe chi.myself.exp0.tr



8. Renaming

Add Normproto to the five files in the Unicharset, Inttemp, Pffmtable, shapetable, and myself of the catalogue. Notice a little. The following figure:


Execute command

Combine_tessdata myself.

Generate this file, which means that we have succeeded.


Copy the file into the Tessdata file, and you can test it using the



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.