Tesseract-OCR character recognition-sample training

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tesseract is an open-source OCR (Optical Character Recognition, Optical Character Recognition) engine that recognizes image files in multiple formats and converts them to text, currently, it supports more than 60 languages (including Chinese ). Tesseract was initially developed by HP and subsequently maintained by Google. It is currently released on the Googel Project. The address is http://code.google.com/p/tesseract-ocr /.

Use the default language library for recognition

1. Install Tesseract

Slave. For test purposes only, the installation file tesseract-ocr-setup-3.02.02.exe under winodws is directly downloaded. After the installation is successful, a Tesseract-OCR directory is generated on the corresponding disk. The tesseract.exe program under the directory can recognize image characters. 2. a pair of images to be identified. In this example, a series of numbers are written with the graphic tool and saved as number.jpg, as shown in:

3. Open the command line, locate the Tesseract-OCR directory, and enter the command:

 tesseract.exe number.jpg result -l eng

Here, result indicates the txt name of the output result file, and eng indicates that the language file used for recognition is in English.

3. Open the result.txt file in the tesseract-ocr directory and check that the recognition result is 7542315857. There are 3 character recognition errors and the recognition rate is not very high. Is there any way to provide the recognition rate? Tesseract provides a set of training sample methods to generate the desired recognition language library. The following describes how to train samples.

The official website of Tesseract-OCR provides a detailed description of training samples: http://code.google.com/p/tesseract-ocr/wiki/trainingtesseract3. Here is a simple example to illustrate how to perform sample training.

1. Download tool jTessBoxEditor.Bytes.

2. Obtain the sample image.5 0-9 text sample images are drawn using a drawing tool (of course, the more samples, the better), as shown in:

3. Merge sample images. Run the jTessBoxEditor tool. On the menu bar, choose Tools> Merge TIFF. In the pop-up dialog box, select the sample image (select multiple images by Shift) and merge it into the num. font. exp0.tif file. 4. Generate the Box File. Open the command line and execute the command:

  tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox

The generated BOX file is num. font. exdomainbox, and the BOX file is the text and coordinates recognized by Tessercat.

Note: The Command Format of Make Box File is:

  tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

Lang indicates the language name, fontname indicates the font name, And num indicates the serial number.

5. Text Correction. Run the jTessBoxEditor tool to open the num. font. excomputif file (The. box and. tif sample files generated in the previous step must be placed in the same directory), as shown in. It can be seen that some characters are incorrectly recognized. You can use this tool to manually correct the characters that are incorrectly recognized in each image. After correction, save the settings.

6. Define the font feature file.A version of Tesseract-OCR3.01 and above requires a font feature file named font_properties to be created before training.

Font_properties does not contain BOM headers. The file content format is as follows:

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

The fontname is the font name and must be consistent with the name in [lang]. [fontname]. exp [num]. box. The values of <italic>, <bold>, <fixed>, <serif>, and <fraktur> are 1 or 0, indicating whether the font has these attributes.

Create a file named font_properties in the directory where the sample image is located, open it in notepad, and enter the following content:

font 0 0 0 0 0

The value is 0, indicating that the font is not bold or italic.

7. Generate a language file.Create a batch file in the directory where the sample image is located, and enter the following content.

Before rem executes batch modification, create the font_properties file echo Run Tesseract for Training..tesseract.exe num in the directory. font. excomputif num. font. exp0 nobatch box. trainecho Compute the Character Set..unicharset_extractor.exe num. font. excompuboxmftraining-F font_properties-U unicharset-O num. unicharset num. font. exclutrecho Clustering..cntraining.exe num. font. exdesktrecho Rename Files .. rename normproto num. normprotorename inttemp num. inttemprename pffmtable num. pffmtablerename shapetable num. shapetable echo Create Tessdata..combine_tessdata.exe num.

Execute the batch processing through the command line. The execution result is as follows:

Make sure that the Offset 1, 3, 4, 5, and 13 in the printed result are not-1. In this way, a new language file is generated.

Num. traineddata is the final language file generated. Copy the generated num. traineddata to the Tesseract-OCR --> tessdata directory. It can be used for character recognition.

Use the trained language library for recognition

Use the language library behind the training to identify the number.jpg file, open the command line, locate the Tesseract-OCR directory, and enter the command:

tesseract.exe number.jpg result -l eng

As shown in the identification result, the recognition rate is improved a lot. You can use custom training samples to perform graphic verification codes and license plate number recognition. Interested friends can study.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tesseract-OCR character recognition-sample training

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tesseract-OCR character recognition-sample training

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support