As we all know, this is an excellent character recognition software. This open-source project can be downloaded at http://code.google.com/p/tesseract-ocr/downloads/list.
We recommend that you use 3 instead of 2 for use. For some reasons, 2 can be used directly in the project, but for some obvious bugs and other reasons, many causes the program to fail or even crash. Therefore, we recommend that you use command line version 3.
In addition to downloading the tesseract installer, you can also download some language libraries on the download page. Of course, you can also select some language libraries for installation during the installation process.
I. Training
In many cases, the default font can be fully recognized with high accuracy, but sometimes we need to train our own library for use. The training procedure is as follows:
Note:
A. Take the doscommand as an example. Save the following commands in each step as. BAT, or run CMD to enter the directory where tesseract is located.
B. ensure the consistency of file names (where all files are DDT)
0. Copy all files in the training Directory to the directory where tesseract3 is located.
Copy. \ training \ *. exe .\
1. Mark the border
Tesseract ddt. tif ddt-l eng digits batch. nochop makebox
Explain
Ddt. tif is the file to be recognized. jpg, gif, tiff, and other formats are supported. We recommend that you use tif.
Ddt is the name of the file to be saved (the extension. box is automatically added)
-L eng library. This parameter allows us to select which font library to mark the border
The configuration file is followed, that is, other parameters of tesseract are loaded as files instead of directly entering parameters.
Digits specifies that only 0-9 digits are recognized (you can edit it to include more characters). When you do not need to specify this parameter, you must remove it, however, the use of this character set limitation can minimize the amount of data you edit by error identification. the probability that the box is dizzy.
Note:
This step is critical, but problems often occur, even if you are. You can write commands to the software for graphical purposes. Therefore, you must edit the generated ddt. box file to ensure that all characters are recognized and the border is correct.
Here is also a small trick, for example, I have done such a tif: 1.2-34567089. In this step, I only recognized the 2-9 part, so I changed tif to 001.2-34567089, all are identified. It may give you some inspiration.
2. forming a language library
Tesseract ddt. tif ddt-l eng digits nobatch box. train
Unicharset_extractor ddt. box
Rename unicharset ddt. unicharset
Mftraining-U unicharset-O ddt. unicharset ddt. tr
Rename inttemp ddt. inttemp
Rename pffmtable ddt. pffmtable
Rename Microfeat ddt. Microfeat
Cntraining ddt. tr
Rename normproto ddt. normproto
Combine_tessdata ddt.
There are several steps in this section, but the "tutorials" of others have become too cool and will not be cool.
Note: The rename is necessary because only the file extension is generated. If you pay attention to this, it will be okay.
3. Test Language Library
Copy ddt. traineddata. \ tessdata \ ddt. traineddata
Tesseract ddt. tif ddt-l ddt
Notepad ddt.txt
If the test fails, check:
A. Is tif width too small? If yes, I suggest you add A line below, that is, change line 1 to Line 2. What should you add and add some characters in your font library at will, but it is better to be as wide as the image.
B. If no correct identification is available, check your ddt. box carefully.
If you fail to clear the previously generated files, run the following command:
Copy ddt. tif tmp. tif
Del ddt. */f/s
Copy tmp. tif ddt. tif
Del tmp. tif
And then start from step 1.
Ii. Use
When using the image, you only need to note that for images with a single line and a small number of characters, if not recognized, it is best to add a useless line below and ensure that the line basically reaches the image width.
Note:
In use, the font library may not be found (especially when you uninstall and reinstall tesseract). In this case, modify
The value of HKEY_CURRENT_USER \ Environment \ TESSDATA_PREFIX is the directory where your tesseract is located.
Iii. Example
Finally, a sample code for tesseract3 in VB. NET is provided.
Code
Public Class TessOCR
Dim path As String = My. Application. Info. DirectoryPath & "\ tesseract3 \"
Sub New ()
My. Computer. Registry. CurrentUser. OpenSubKey ("Environment", True). SetValue ("TESSDATA_PREFIX", path)
End Sub
Public Function Tess3OCR (ByVal Rect As Rectangle, ByVal clr As Integer) As String
'Create an image. Note that SourceCopy is used to format images that meet OCR requirements during screen copying. Otherwise, an error occurs or the image is directly disabled.
Dim bmp As Bitmap = New Bitmap (Rect. Width, Rect. Height * 2)
Dim gr As Graphics = Graphics. FromImage (bmp)
Gr. Clear (Color. White)
Gr. CopyFromScreen (Rect. Location, Point. Empty, Rect. Size, CopyPixelOperation. SourceCopy)
'Corrected to black and white characters
For y As Integer = 0 To bmp. Height-1
For x As Integer = 0 To bmp. Width-1
If bmp. GetPixel (x, y). ToArgb = clr Then bmp. SetPixel (x, y, Color. Black) Else bmp. SetPixel (x, y, Color. White)
Next
Next
Dim str As String = IIf (clr = AngleColor, "45.000000", "0.000000 ")
Gr. DrawString (str, New Font ("Arial Black", 14), Brushes. Black, 0, Rect. Height)
Bmp. Save (path & "tmp. tif", System. Drawing. Imaging. ImageFormat. Tiff)
Shell (path & "tesseract" & path & "tmp. tif" & path & "tmp-l ddt digits", AppWinStyle. Hide, True)
My. Computer. FileSystem. DeleteFile (path & "tmp. tif ")
Dim ret As String = My. Computer. FileSystem. ReadAllText (path & "tmp.txt"). Split (vbCrLf) (0)
My. Computer. FileSystem. DeleteFile (path & "tmp.txt ")
Return ret
End Function
End Class
In the new function of the code, I modified the registry to prevent errors. A better way is to record the original value and restore it when the class is destroyed. Then I pointed out some problems that may exist during screen replication. Of course, if you are using the verification code, you don't have to worry about it. Then, we make a simple correction for the image. It should be noted that the image must be corrected to a white or black character. Otherwise, the image will not be recognized. Then, I added a line of useless text below and processed it properly when the returned value was returned. Note that the last parameter of the shell function indicates that the process is waiting for completion. If you want to use it in vb6, here we need to use APIs to implement wait-instead of using sleep and other timed wait functions, which will make your program not robust enough.