Tesseract 3 Training methods of linguistic data

Source: Internet
Author: User

programs that need to be used

(1) Tesseract 3.00
(2) tesseract 3.00 bugfix
(3) Cowboxer 1.01
(4) Universal Extractor 1.61 (not required)

Use the Universal Extractor to unpack the Tesseract installation package, and then use the Tesseract.exe in Bugfix to cover the original main program, Tesseract is available. Cowboxer is the program used to modify the box file.

generate the first box file

The demo extracts tesseract to the E:\TESSERACT-OCR directory. A build directory is then created in this directory to hold the raw data and the files generated during the training process. There are 3 original image data (test.001.tif-test.003.tif):





First create the box file of the first picture Test.001.tif, where the official ENG language data is used for word recognition:

E:\tesseract-ocr\build&gt, .... \tesseract test.001.tif test.001-l Eng batch.nochop Makebox
Tesseract Open Source OCR Engine with Leptonica
Number of found pages:1.
After executing this command, a test.001.box is generated in the build directory. Using Cowboxer to open the box file, Cowboxer will automatically find the TIF file with the same name displayed.



The use of cowboxer can be seen in help-and-about instructions. After the modification is complete, file--Save box files are saved.

generate the initial traineddata

Next Use this box file as a traineddata, in the next generation of other pictures of the box file, the use of this traineddata is conducive to improve the accuracy of recognition, reduce the number of changes.

.. \tesseract test.001.tif test.001 Nobatch box.train
.. \training\unicharset_extractor Test.001.box
.. \training\mftraining-u Unicharset-o Test.unicharset test.001.tr
.. \training\cntraining test.001.tr
Rename Normproto Test.normproto
Rename Microfeat test. Microfeat
Rename Inttemp test.inttemp
Rename Pffmtable test.pffmtable
.. \training\combine_tessdata test.
After executing this series of commands in the build directory, the available test.traineddata are generated.

generate the rest of the box files

Move the test.traineddata generated in the previous step to the Tesseract-ocr\tessdata directory, and then you can use it with the-l test parameter when generating additional box files.

.. \tesseract test.002.tif test.002-l Test Batch.nochop Makebox
.. \tesseract test.003.tif test.003-l Test Batch.nochop Makebox
This is just an example of using 3 original files. When you actually make a training file, when to generate a traineddata depends on the situation. The purpose of traineddata generation is only to improve the accuracy of word recognition, so that the box files generated later can be modified less.

generate the final traineddata

After all the boxes have been made, the final traineddata can be generated.

.. \tesseract test.001.tif test.001 Nobatch box.train
.. \tesseract test.002.tif test.002 Nobatch box.train
.. \tesseract test.003.tif test.003 Nobatch box.train
.. \training\unicharset_extractor Test.001.box Test.002.box Test.003.box
.. \training\mftraining-u unicharset-o test.unicharset test.001.tr test.002.tr test.003.tr
.. \training\cntraining test.001.tr test.002.tr test.003.tr
Rename Normproto Test.normproto
Rename Microfeat test. Microfeat
Rename Inttemp test.inttemp
Rename Pffmtable test.pffmtable
.. \training\combine_tessdata test.
This script execution can be generated by the program when there are more files. Current language: Chinese (Simplified)

Tesseract 3 Training methods of linguistic data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.