Training and use of tesseract-ocr3

Source: Internet
Author: User

As we all know, this is an excellent character recognition software. This open-source project can be downloaded at http://code.google.com/p/tesseract-ocr/downloads/list.

We recommend that you use 3 instead of 2 for use. For some reasons, 2 can be used directly in the project, but for some obvious bugs and other reasons, many causes the program to fail or even crash. Therefore, we recommend that you use command line version 3.

In addition to downloading the tesseract installer, you can also download some language libraries on the download page. Of course, you can also select some language libraries for installation during the installation process.

 

I. Training

In many cases, the default font can be fully recognized with high accuracy, but sometimes we need to train our own library for use. The training procedure is as follows:

Note:

A. Take the doscommand as an example. Save the following commands in each step as. BAT, or run CMD to enter the directory where tesseract is located.

B. ensure the consistency of file names (where all files are DDT)

0. Copy all files in the training Directory to the directory where tesseract3 is located.

Copy. \ training \ *. exe .\

1. Mark the border

Tesseract ddt. tif ddt-l eng digits batch. nochop makebox

Explain

Ddt. tif is the file to be recognized. jpg, gif, tiff, and other formats are supported. We recommend that you use tif.

Ddt is the name of the file to be saved (the extension. box is automatically added)

-L eng library. This parameter allows us to select which font library to mark the border

The configuration file is followed, that is, other parameters of tesseract are loaded as files instead of directly entering parameters.

Digits specifies that only 0-9 digits are recognized (you can edit it to include more characters). When you do not need to specify this parameter, you must remove it, however, the use of this character set limitation can minimize the amount of data you edit by error identification. the probability that the box is dizzy.

Note:

This step is critical, but problems often occur, even if you are. You can write commands to the software for graphical purposes. Therefore, you must edit the generated ddt. box file to ensure that all characters are recognized and the border is correct.

Here is also a small trick, for example, I have done such a tif: 1.2-34567089. In this step, I only recognized the 2-9 part, so I changed tif to 001.2-34567089, all are identified. It may give you some inspiration.

2. forming a language library

Tesseract ddt. tif ddt-l eng digits nobatch box. train
Unicharset_extractor ddt. box
Rename unicharset ddt. unicharset
Mftraining-U unicharset-O ddt. unicharset ddt. tr
Rename inttemp ddt. inttemp
Rename pffmtable ddt. pffmtable
Rename Microfeat ddt. Microfeat
Cntraining ddt. tr
Rename normproto ddt. normproto
Combine_tessdata ddt.

There are several steps in this section, but the "tutorials" of others have become too cool and will not be cool.

Note: The rename is necessary because only the file extension is generated. If you pay attention to this, it will be okay.

3. Test Language Library

Copy ddt. traineddata. \ tessdata \ ddt. traineddata

Tesseract ddt. tif ddt-l ddt
Notepad ddt.txt

If the test fails, check:

A. Is tif width too small? If yes, I suggest you add A line below, that is, change line 1 to Line 2. What should you add and add some characters in your font library at will, but it is better to be as wide as the image.

B. If no correct identification is available, check your ddt. box carefully.

 

 

If you fail to clear the previously generated files, run the following command:

Copy ddt. tif tmp. tif
Del ddt. */f/s
Copy tmp. tif ddt. tif
Del tmp. tif

And then start from step 1.

 

Ii. Use

When using the image, you only need to note that for images with a single line and a small number of characters, if not recognized, it is best to add a useless line below and ensure that the line basically reaches the image width.

Note:

In use, the font library may not be found (especially when you uninstall and reinstall tesseract). In this case, modify

The value of HKEY_CURRENT_USER \ Environment \ TESSDATA_PREFIX is the directory where your tesseract is located.

 

 

Iii. Example

Finally, a sample code for tesseract3 in VB. NET is provided.

 

Code

Public Class TessOCR

Dim path As String = My. Application. Info. DirectoryPath & "\ tesseract3 \"

Sub New ()
My. Computer. Registry. CurrentUser. OpenSubKey ("Environment", True). SetValue ("TESSDATA_PREFIX", path)
End Sub

Public Function Tess3OCR (ByVal Rect As Rectangle, ByVal clr As Integer) As String
'Create an image. Note that SourceCopy is used to format images that meet OCR requirements during screen copying. Otherwise, an error occurs or the image is directly disabled.
Dim bmp As Bitmap = New Bitmap (Rect. Width, Rect. Height * 2)
Dim gr As Graphics = Graphics. FromImage (bmp)
Gr. Clear (Color. White)
Gr. CopyFromScreen (Rect. Location, Point. Empty, Rect. Size, CopyPixelOperation. SourceCopy)
'Corrected to black and white characters
For y As Integer = 0 To bmp. Height-1
For x As Integer = 0 To bmp. Width-1
If bmp. GetPixel (x, y). ToArgb = clr Then bmp. SetPixel (x, y, Color. Black) Else bmp. SetPixel (x, y, Color. White)
Next
Next
Dim str As String = IIf (clr = AngleColor, "45.000000", "0.000000 ")
Gr. DrawString (str, New Font ("Arial Black", 14), Brushes. Black, 0, Rect. Height)

Bmp. Save (path & "tmp. tif", System. Drawing. Imaging. ImageFormat. Tiff)
Shell (path & "tesseract" & path & "tmp. tif" & path & "tmp-l ddt digits", AppWinStyle. Hide, True)
My. Computer. FileSystem. DeleteFile (path & "tmp. tif ")
Dim ret As String = My. Computer. FileSystem. ReadAllText (path & "tmp.txt"). Split (vbCrLf) (0)
My. Computer. FileSystem. DeleteFile (path & "tmp.txt ")
Return ret
End Function
End Class

 

 

In the new function of the code, I modified the registry to prevent errors. A better way is to record the original value and restore it when the class is destroyed. Then I pointed out some problems that may exist during screen replication. Of course, if you are using the verification code, you don't have to worry about it. Then, we make a simple correction for the image. It should be noted that the image must be corrected to a white or black character. Otherwise, the image will not be recognized. Then, I added a line of useless text below and processed it properly when the returned value was returned. Note that the last parameter of the shell function indicates that the process is waiting for completion. If you want to use it in vb6, here we need to use APIs to implement wait-instead of using sleep and other timed wait functions, which will make your program not robust enough.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.