Tesseract is an open-source OCR (Optical Character Recognition, Optical Character Recognition) engine that recognizes image files in multiple formats and converts them to text, currently, it supports more than 60 languages (including Chinese ). Tesseract was initially developed by HP and subsequently maintained by Google. It is currently released on the Googel Project. The address is http://code.google.com/p/tesseract-ocr /.
Use the default language library for recognition
1. Install Tesseract
Slave. For test purposes only, the installation file tesseract-ocr-setup-3.02.02.exe under winodws is directly downloaded. After the installation is successful, a Tesseract-OCR directory is generated on the corresponding disk. The tesseract.exe program under the directory can recognize image characters. 2. a pair of images to be identified. In this example, a series of numbers are written with the graphic tool and saved as number.jpg, as shown in:
3. Open the command line, locate the Tesseract-OCR directory, and enter the command:
tesseract.exe number.jpg result -l eng
Here, result indicates the txt name of the output result file, and eng indicates that the language file used for recognition is in English.
3. Open the result.txt file in the tesseract-ocr directory and check that the recognition result is 7542315857. There are 3 character recognition errors and the recognition rate is not very high. Is there any way to provide the recognition rate? Tesseract provides a set of training sample methods to generate the desired recognition language library. The following describes how to train samples.
The official website of Tesseract-OCR provides a detailed description of training samples: http://code.google.com/p/tesseract-ocr/wiki/trainingtesseract3. Here is a simple example to illustrate how to perform sample training.
1. Download tool jTessBoxEditor.Bytes.
2. Obtain the sample image.5 0-9 text sample images are drawn using a drawing tool (of course, the more samples, the better), as shown in:
3. Merge sample images. Run the jTessBoxEditor tool. On the menu bar, choose Tools> Merge TIFF. In the pop-up dialog box, select the sample image (select multiple images by Shift) and merge it into the num. font. exp0.tif file. 4. Generate the Box File. Open the command line and execute the command:
tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox
The generated BOX file is num. font. exdomainbox, and the BOX file is the text and coordinates recognized by Tessercat.
Note: The Command Format of Make Box File is:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
Lang indicates the language name, fontname indicates the font name, And num indicates the serial number.
5. Text Correction. Run the jTessBoxEditor tool to open the num. font. excomputif file (The. box and. tif sample files generated in the previous step must be placed in the same directory), as shown in. It can be seen that some characters are incorrectly recognized. You can use this tool to manually correct the characters that are incorrectly recognized in each image. After correction, save the settings.
6. Define the font feature file.A version of Tesseract-OCR3.01 and above requires a font feature file named font_properties to be created before training.
Font_properties does not contain BOM headers. The file content format is as follows:
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
The fontname is the font name and must be consistent with the name in [lang]. [fontname]. exp [num]. box. The values of <italic>, <bold>, <fixed>, <serif>, and <fraktur> are 1 or 0, indicating whether the font has these attributes.
Create a file named font_properties in the directory where the sample image is located, open it in notepad, and enter the following content:
font 0 0 0 0 0
The value is 0, indicating that the font is not bold or italic.
7. Generate a language file.Create a batch file in the directory where the sample image is located, and enter the following content.
Before rem executes batch modification, create the font_properties file echo Run Tesseract for Training..tesseract.exe num in the directory. font. excomputif num. font. exp0 nobatch box. trainecho Compute the Character Set..unicharset_extractor.exe num. font. excompuboxmftraining-F font_properties-U unicharset-O num. unicharset num. font. exclutrecho Clustering..cntraining.exe num. font. exdesktrecho Rename Files .. rename normproto num. normprotorename inttemp num. inttemprename pffmtable num. pffmtablerename shapetable num. shapetable echo Create Tessdata..combine_tessdata.exe num.
Execute the batch processing through the command line. The execution result is as follows:
Make sure that the Offset 1, 3, 4, 5, and 13 in the printed result are not-1. In this way, a new language file is generated.
Num. traineddata is the final language file generated. Copy the generated num. traineddata to the Tesseract-OCR --> tessdata directory. It can be used for character recognition.
Use the trained language library for recognition
Use the language library behind the training to identify the number.jpg file, open the command line, locate the Tesseract-OCR directory, and enter the command:
tesseract.exe number.jpg result -l eng
As shown in the identification result, the recognition rate is improved a lot. You can use custom training samples to perform graphic verification codes and license plate number recognition. Interested friends can study.