Training Method http://blog.csdn.net/dragoo1/article/details/8439373 for tesseract 3 Language Data

Source: Internet
Author: User
Training Methods for tesseract 3 language data (to) classification: open-source 92 people read comments (0) report collection

Note: I have downloaded the source code from Google code. I have converted it into lib_debug and then generated dll_debug. So I copied it directly from E: \ buildfolder \ Tesseract-OCR \ vs2008 \ lib_debug.

Upload to E: \ buildfolder \ Tesseract-OCR \ Testing

Steps:

1.1. Make Box files

E: \ buildfolder \ Tesseract-OCR \ Testing> Tesseract-dlld ABC. Roman. exstmtif ABC. Roman. exp0-l Eng batch. nochop makebox
Tesseract open source OCR engine v3.02 with leptonica

1.2. Fix box

Use cowboxer to edit the content, depending on help

1.3. Run tesseract for training

E: \ buildfolder \ Tesseract-OCR \ Testing> Tesseract-dlld ABC. Roman. exstmtif ABC. Roman. exp0 nobatch box. Train
Tesseract open source OCR engine v3.02 with leptonica
Apply_boxes:
Boxes read from boxfile: 14
Found 14 good blobs.
Training... font name = Roman
Generated training data for 2 words

1.4. compute the character set

E: \ buildfolder \ Tesseract-OCR \ Testing> unicharset_extractord ABC. Roman. ex1_box
Extracting unicharset from ABC. Roman. extracbox
Wrote unicharset file./unicharset.

1.5. Clustering

In this step, you must first create a font_properties.txt file in the following format:

 

[Plain]View plaincopy

  1. <Fontname> <italic> <bold> <fixed> <serif> <fraktur>

My content is

 

 

[Plain]View plaincopy

  1. Roman 0 0 0 0 0

E: \ buildfolder \ Tesseract-OCR \ Testing> mftrainingd-F font_properties.txt-u unicharset ABC. Roman. ex1_tr
Warning: no shape table file present: shapetable
Reading ABC. Roman. ex1_tr...
Flat shape table Summary: Number of shapes = 12 max unichars = 1 number with multiple unichars = 0
Done!

 

E: \ buildfolder \ Tesseract-OCR \ Testing> cntrainingd ABC. Roman. ex1_tr
Reading ABC. Roman. ex1_tr...
Clustering...

Writing normproto...

1.6. Combine

At this point, several files should be generated under the directory, and the four files unicharset, inttemp, normproto, and pffmtable should be prefixed with "Roman .". Then enter the command:

E: \ buildfolder \ Tesseract-OCR \ Testing> combine_tessdatad Roman.
Combining tessdata files
Tessdatamanager combined tesseract data files.
Offset for Type 0 is-1
Offset for type 1 is 140
Offset for Type 2 is-1
Offset for Type 3 is 939
Offset for Type 4 is 140232
Offset for type 5 is 140335
Offset for Type 6 is-1
Offset for Type 7 is-1
Offset for Type 8 is-1
Offset for Type 9 is-1
Offset for type 10 is-1
Offset for Type 11 is-1
Offset for Type 12 is-1
Offset for Type 13 is 141961
Offset for Type 14 is-1
Offset for Type 15 is-1

1.7. Test

Copy the generated Roman. traineddata to E: \ buildfolder \ Tesseract-OCR \ Testing \ tessdata

Tesseract ABC. Roman. excomputif result-l Roman-PSM 7 nobatch

In this case, OK.

Reference: http://blog.wudilabs.org/entry/f25efc5f? Lang = ZH-CN

Http://www.lixin.me/blog/2012/05/26/29536

Http://wenku.baidu.com/view/5eafc201e87101f69e3195f4.html

Http://www.84kf.com/html/22453.html

Http://blog.csdn.net/fengbingchun/article/details/7022421

Zookeeper -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

From: http://blog.wudilabs.org/entry/f25efc5f? Lang = ZH-CN

Programs to be used

(1) tesseract 3.00.
(2) tesseract 3.00 bugfix
(3) cowboxer 1.01
(4) Universal extractor 1.61 (not required)

Use the universal extractor to unmount the Tesseract installation package, and overwrite the original main program with tesseract.exe in the bugfix, so that Tesseract can be used. Cowboxer is a program used to modify the box file.

Generate the first box file

In the demo, extract Tesseract to the E: \ Tesseract-OCR directory. Then, a build directory is created in the directory to store the raw data and files generated during training. There are three original image data (test.001.tif-test.003.tif ):



First, generate the box file of the first image test.001.tif. Here we use the official Eng language data for text recognition:

E: \ Tesseract-OCR \ build> .. \ tesseract test.001.tif test.001-l Eng batch. nochop makebox
Tesseract open source OCR engine with leptonica
Number of found pages: 1.

After executing this command, a test.001.box is generated in the build directory. Use cowboxer to open the box file. cowboxer will automatically find the tif file with the same name.

For how to use cowboxer, see the instructions in help-> about. After the modification is complete, choose file-> Save box file to save the file.

Generate the initial traineddata

Next we will use this box file to generate a traineddata. When we generate a box file for other images, using this traineddata will help improve the recognition accuracy and reduce the number of modifications.

.. \ Tesseract test.001.tif test.001 nobatch box. Train
.. \ Training \ unicharset_extractor test.001.box
.. \ Training \ mftraining-u unicharset-O test. unicharset test.001.tr
.. \ Training \ cntraining test.001.tr
Rename normproto test. normproto
Rename microfeat test. microfeat
Rename inttemp test. inttemp
Rename pffmtable test. pffmtable
.. \ Training \ combine_tessdata test.

After executing these commands in the build directory, the available test. traineddata is generated.

Generate other box files

Move the test. traineddata generated in the previous step to the Tesseract-OCR \ tessdata directory. When other box files are generated, you can use the-l test parameter.

.. \ Tesseract test.002.tif test.002-l test batch. nochop makebox
.. \ Tesseract test.003.tif test.003-l test batch. nochop makebox

Here we only use three original files as an example. When a training file is created, a traineddata is generated based on the actual situation. The purpose of generating traineddata in the middle is to improve the accuracy of text recognition, so that fewer modifications can be made to the box files generated later.

Generate the final traineddata

After all the boxes are created, the final traineddata can be generated.

.. \ Tesseract test.001.tif test.001 nobatch box. Train
.. \ Tesseract test.002.tif test.002 nobatch box. Train
.. \ Tesseract test.003.tif test.003 nobatch box. Train
.. \ Training \ unicharset_extractor test.001.box test.002.box test.003.box
.. \ Training \ mftraining-u unicharset-O test. unicharset test.001.tr test.002.tr test.003.tr
.. \ Training \ cntraining test.001.tr test.002.tr test.003.tr
Rename normproto test. normproto
Rename microfeat test. microfeat
Rename inttemp test. inttemp
Rename pffmtable test. pffmtable
.. \ Training \ combine_tessdata test.

When there are many files, you can use the program to generate such scripts for execution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.