The simple use of TESSERACT-OCR is related to training __java

Source: Internet
Author: User
simple use and training of TESSERACT-OCR

Tesseract, an Open-source OCR (optical Character recognition, optical character recognition) engine developed by the HP Lab, maintained by Google, and Microsoft Office Document Imaging (MODI), we can continue to train the library, so that the image of the ability to convert text is constantly enhanced, if the team depth needs, you can also use it as a template, to develop a consistent with their own needs of the OCR engine.

The source address is: https://github.com/tesseract-ocr/tesseract;

EXE executable file address: http://download.csdn.net/download/whatday/7740469;

Next, we will install tesseract in the Windows environment and implement simple transformations and training: 1, tesseract implementation

General process: Tesseract installation-> open command line-> generate target file tesseract install

Download Tesseract-ocr-setup-3.02.02.exe installation package, after the successful installation will have a TESSERACT-OCR folder under the corresponding disk, as shown

Open command line

Open the command line, enter tesseract, carriage return; The following are the general features of Tesseract:

Generate target file

First prepare a picture file, such as Test.png

Switch the command line to the destination image file directory, such as we convert the file to test.png (picture file allows a variety of formats), located in C:\Users\Lian\Desktop\test, and then on the command line to enter

Tesseract Test.png Output_1–l Eng

"Syntax": tesseract imagename outputbase [-L lang] [-PSM pagesegmode] [configfile ...]

ImageName is the target picture file name, the format suffix is required; outputbase is the transformation result file name ; Lang is the language name (a language file eng.traineddata that starts with Eng is visible in the Tessdata folder in TESSERACT-OCR), such as Eng if it is not marked-L.

Open the file Output_1.txt and find that tesseract successfully converted the image to 152408.

Welcome to congratulate, the old name Tesseract is still very strong. But it's still a little inaccurate, so is there any way we can improve the accuracy of tesseract recognition characters? Next, we will use the matching training tool Jtessboxeditor to train the sample, to improve our accuracy rate.

2, tesseract training:

The general process is: Install jtessboxeditor-> Get sample file-> Merge sample file –> Generate Box file-> define character Profile-> character correction-> Execute batch file-> will generate Traine Ddata put in Tessdata to install Jtessboxeditor

Download Jtessboxeditor, address https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/, after decompression to get Jtessboxeditor, Since this is developed by Java, we should make sure to install the JRE (Java Runtime Environment,java Runtime Environment) before running Jtessboxeditor. Get sample file

We can use the drawing tool to draw the sample file, the more the more the better, I drew my own 5 pictures, as shown:

"Note": The sample image File format must be in Tif\tiff format, otherwise a couldn ' t seek error will occur during the merge sample file.

Merge sample file

Open Jtessboxeditor,tools->merge TIFF, select all the sample files, and save the merged file as a num.font.exp0.tif build box File

Open the command line and switch to the Num.font.exp0.tif directory, enter, generate file name Num.font.exp0.box

Tesseract num.font.exp0.tif num.font.exp0 batch.nochop Makebox

"Syntax": tesseract [lang]. [Fontname].exp[num].tif [Lang]. [Fontname].exp[num] Batch.nochop Makebox

Lang is the language name, the fontname is the font name, num is the ordinal, and in tesseract, you must pay attention to the format. defining a character configuration file

Generates a text file named Font_properties in the target folder, with the content

Font 0 0 0 0 0  

"Syntax":<fontname> <italic> <bold> <fixed> <serif> <fraktur>

FontName is the font name, the italic is italic, bold is bold, fixed is the default font, serif is a line font, Fraktur German black font, 1 and 0 are for and without, fine distinction can be used. character Correction

Open Jtessboxeditor,box Editor-> Open, turn on num.font.exp0.tif, correct the characters on <Char>, remember <Page> there are so many pages oh.

Remember to save after modification. Execute batch file

Generate a batch file under the target directory


echo Run tesseract for training. Tesseract.exe num.font.exp0.tif num.font.exp0 Nobatch box.train echo Compute the Character Set. Unicharset_extractor.exe num.font.exp0.box mftraining-f font_properties-u unicharset-o num.unicharset num.font.exp0.tr echo Clustering. Cntraining.exe num.font.exp0.tr Echo Rename Files. Rename Normproto num.normproto rename inttemp num.inttemp rename pffmtable num.pffmtable rename Shapetable num.shapetable Echo Create tessdata.
Echo. & Pause

Execution can be done after saving, and the results are shown as follows:

The following files are available in the final folder, as shown in the figure:

put the generated traineddata into the Tessdata

Finally, copy the Num.trainddata to the Tessdata folder in TESSERACT-OCR.

3, the final Test

Follow the previous steps and use the command line input

Tesseract test.png output_2-l num

We can see that the newly generated file output_2 content is 762408, and the content is completely correct. The careful person will find that the last sentence, we used the directive [-l NUM] instead of [-l ENG]. This means that the last conversion we used was a matching library for the newly generated NUM language instead of the default Eng language matching library.

We can see that, after simple training, we have improved the accuracy of the conversion of digital data. In addition to the advantages of tesseract can continue to learn, but also because the use of C + + written open source program, can use C # or C + + call and modify, is critical.

About Tesseract, about OCR, about the computer, there are too many worthy of their own to learn, I hope that in the future can be recorded here.

If you have any mistakes or suggestions, please feel free to advise.

Sophomore Summer Internship

2016/8/12

Author: Xiao Lian

Source: http://www.cnblogs.com/cnlian/>

This article copyright to the author and blog Park, Welcome to reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location given, the original link if there is a problem, can be mail (671266128@qq.com) consultation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.