TESSERACT-OCR Training Chinese _

TESSERACT-OCR Training Chinese __tesseract

Last Update:2018-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

in order to improve the recognition rate of tesseract library, it can be trained in Chinese characters.

1. Install Tesseract first. Note Here to install, because the installed program contains other training programs, the compiled version does not have these tools.

2. Download the Jtessboxeditor tool. This tool is written in Java and requires the JRE to run. This tool is mainly used to modify the box file to proofread text. The following figure is the directory of the tool, directly click on the red box to run the program.

This preparation allows the library to recognize the cancellation of these two words, prepared a 5 map:

3. To generate the files in TIF format

It is best to put the pictures in the Tesseract Library's installation directory, and then do the work in this directory. Click the merge TIFF in the Jtessboxeditor tools button. Then select all 5 of our samples and click Open. This will pop up another save dialog box, is the TIF file we want, for TIF file naming rules [lang]. [Fontname].exp[num].tif. Where Lang is a language, FontName is a font. According to their own needs set. Click Save, this time the directory will have our TIF files.

4. Generate Box File

first open the command line, enter the Tesseract directory, enter the command: Tesseract.exe chi . Myself.exp0.tif chi.myself . exp0 batch.nochop Makebox

5. Proofing text

use Jtessboxeditor to open the TIF file you just generated

we will find that the information displayed in the text is incorrect.

we need to correct all the characters in the Char catalogue of each picture. Now the Tesseract Library will be recognized as four parts, so there are 1,2,3,4 four lines, we need it to be calibrated to two lines, and the character should be canceled. Follow these steps:

This time the two parts are together. But char this column shows H and should be changed to fetch. Follow these steps:

other characters are the same, and the final effect is this:

I have a total of 5 pictures, after they have been changed, click Save. At this time we can look at the Chi.myself.exp0.box file (Notepad open), will find that there is a correction.

Note: This step correction tool can also be used directly in the box file, but error prone.

6. To generate a. tr file

Tesseract.exe  chi.myself.exp0.tif chi.myself.exp0  nobatch box.train

7. Generate Unicharset files.

Unicharset_extractor Chi.myself.exp0.box

7. New Font_properties File

Create a new plaintext Font_properties file in Notepad with the following format:

Use Notepad like: myself 0 0 0 0 0 Remember is 5 0.

7. Run the following three commands:

Shapeclustering.exe-f font_properties.txt-u Unicharset chi.myself.exp0.tr

Mftraining.exe-f font_properties.txt-u unicharset-o Unicharset chi.myself.exp0.tr

Cntraining.exe chi.myself.exp0.tr

8. Renaming

Add Normproto to the five files in the Unicharset, Inttemp, Pffmtable, shapetable, and myself of the catalogue. Notice a little. The following figure:

Execute command

Combine_tessdata myself.

Generate this file, which means that we have succeeded.

Copy the file into the Tessdata file, and you can test it using the

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TESSERACT-OCR Training Chinese __tesseract

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

TESSERACT-OCR Training Chinese __tesseract

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support