As we all know, this is an excellent character recognition software. This open source project can be downloaded from http://code.google.com/p/tesseract-ocr/downloads/list.
When using, it is recommended to use 3 instead of 2, for some reason, 2 can be used directly in the project, but due to some obvious bugs and other reasons, many causes the program to not run or even crash. So we recommend using the command-line version of 3.
In addition to downloading the Tesseract installer, you can also download some language libraries from the download page, or select some language libraries during the installation process.
First, training
In many cases, the default font can be highly accurate, but sometimes we need to train our own library to use. The training steps are as follows:
Note:
A, the DOS command as an example, the following command will be saved to each step. Bat runs, or runs cmd into the directory where tesseract runs
B. Ensure the consistency of file names (DDT is here)
0. Copy all files in training directory to tesseract3 directory
copy. \training\*.exe. \
1. Mark Border
Tesseract ddt.tif ddt-l eng digits batch.nochop Makebox
Explain
Ddt.tif for the file to be recognized, support Jpg,gif,tiff and other formats, we recommend using TIF
DDT is the name of the file to be saved (auto-add extension. box)
The library used by-L ENG, this parameter allows us to choose which font to use to mark the border
The following are the configuration files, that is, the other parameters of Tesseract are loaded as files, rather than directly input parameters
digits specifies a number that recognizes only 0-9 (of course you can edit it to include more characters), and when you don't need to specify it, be sure to remove the parameter, but using this character set limit minimizes the chance of your editing Ddt.box's dizzy by being incorrectly identified.
Note:
This step is critical, but there are often problems, even if you download bbtesseract in http://code.google.com/p/bbtesseract/downloads/list, so I feel I should make a border recognition, But there is no time to do it. It is perfectly possible to write commands to the software for graphical use. Therefore, be careful to edit the Ddt.box file that you generated to ensure that the characters are recognized and the border is correct.
There is also a small trick, such as I have done such a tif:1.2-34567089, in this step, only identified 2-9 this part, so I modified TIF to: 001.2-34567089, it is all recognized. Maybe I can give you some inspiration.
2. Form a language library
Tesseract ddt.tif ddt-l eng digits Nobatch Box.train
Unicharset_extractor Ddt.box
Rename Unicharset Ddt.unicharset
Mftraining-u Unicharset-o Ddt.unicharset ddt.tr
Rename Inttemp ddt.inttemp
Rename Pffmtable ddt.pffmtable
Rename Microfeat DDT. Microfeat
Cntraining ddt.tr
Rename Normproto Ddt.normproto
Combine_tessdata DDT.
This contains a number of steps, but the other people to the "tutorial" has been more verbose, no longer verbose.
Note: Those few rename are necessary because the generated file only has an extension. As long as you pay attention to these, there is no problem.
3. Test Language Library
Copy Ddt.traineddata. \tessdata\ddt.traineddata
Tesseract ddt.tif Ddt-l DDT
Notepad ddt.txt
If the test fails, you should check:
A, whether the TIF width is too small, if so, I suggest you add a line below, that is, to change 1 rows to 2 lines, add something, feel free to add some characters inside your font, but preferably as wide as the image.
B, if not correctly identified, look back and check your Ddt.box
If you fail, remember to clean up the file you generated earlier, you can use the command:
Copy ddt.tif Tmp.tif
Del ddt.*/f/s
Copy tmp.tif Ddt.tif
Del tmp.tif
Then start over again from the first step.
Second, use
When used, it is important to note that for an image with a single line and fewer characters, if it is not recognized, it is best to add a row of useless rows below and ensure that the line basically reaches the image width.
Note:
You may not find a font when you use it (especially if you reinstall tesseract when you uninstall), you should modify
The value of Hkey_current_user\environment\tessdata_prefix is the directory where your tesseract resides.
Iii. examples
Finally, we give a sample code that TESSERACT3 uses under vb.net.
Public Class TESSOCR
Dim path as String = My.Application.Info.DirectoryPath & "\tesseract3\"
Sub New ()
My.Computer.Registry.CurrentUser.OpenSubKey ("Environment", True). SetValue ("Tessdata_prefix", Path)
End Sub
Public Function TESS3OCR (ByVal Rect as Rectangle, ByVal CLR as Integer) as String
' Create an image, note that the screen is copied using sourcecopy to conform to the OCR-compliant image format, or else error or close directly
Dim bmp as Bitmap = New Bitmap (rect.width, Rect.height * 2)
Dim gr as Graphics = Graphics.fromimage (BMP)
Gr. Clear (Color.White)
Gr. CopyFromScreen (Rect.location, Point.empty, Rect.size, copypixeloperation.sourcecopy)
' Corrected to black and white
For y as Integer = 0 to BMP. Height-1
For x as Integer = 0 to BMP. Width-1
If bmp. GetPixel (x, y). ToArgb = CLR then bmp. SetPixel (x, y, Color.Black) Else bmp. SetPixel (x, y, Color.White)
Next
Next
Dim str as String = IIf (CLR = Anglecolor, "45.000000", "0.000000")
Gr. DrawString (str, New Font ("Arial Black", +), Brushes.black, 0, Rect.height)
Bmp. Save (Path & "Tmp.tif", System.Drawing.Imaging.ImageFormat.Tiff)
Shell (Path & "Tesseract" & Path & "Tmp.tif" & Path & "Tmp-l DDT digits", appwinstyle.hide, True)
My.Computer.FileSystem.DeleteFile (Path & "Tmp.tif")
DIM ret as String = My.Computer.FileSystem.ReadAllText (Path & "Tmp.txt"). Split (VbCrLf) (0)
My.Computer.FileSystem.DeleteFile (Path & "Tmp.txt")
Return ret
End Function
End Class
In the new function of the code, I modified the registry to prevent errors, and a better practice would be to record the original value and restore it when the class was destroyed. After that, it points out some of the problems that may exist on screen copying, and of course, if you are taking a verification code, you don't have to worry about it. Then a simple correction of the image, it is important to note that must be corrected to the white black word before the line, otherwise not recognized. I then added a line of useless text below and handled it appropriately when I returned the value. One more thing to note is that the last parameter of the Shell function indicates the end of the wait call process, and if you are going to use it in VB6, you need to implement the wait with the API--instead of waiting for the function with sleep and so on, that will make your program less robust.
Transferred from: http://blog.csdn.net/foxwit/article/details/6547465
How to use OCR recognition engine tesseract
Recently has been working with OCR, learning the next Google's OCR engine tesseract, is a good identification tool. TESSERACT-3.0 has supported layout analysis and is very powerful. Leptonica and Libtiff can be installed selectively prior to installing Tesseract. However, it is recommended that you install both libraries first. BMP files can only be processed without TIFF installed.
Here is just a description of how to recognize Chinese. After installing Libtiff,leptonica and tesseract in turn, download the training data for Simplified Chinese and Traditional Chinese, which can be found on the download page of Tesseract. To the Tessdata folder in a directory. Then set the directory for the environment variable tessdata_prefix=tessdata. Then, create a new Ocr.cpp file and write the following code:
#include <mfcpch.h>
#include <ctype.h>
#include <sys/time.h>
#include "applybox.h"
#include "control.h"
#include "Tessvars.h"
#include "Tessedit.h"
#include "baseapi.h"
#include "Thresholder.h"
#include "Pageres.h"
#include "Imgs.h"
#include "Varabled.h"
#include "tprintf.h"
#include "Stderr.h"
#include "notdll.h"
#include "Mainblk.h"
#include "Output.h"
#include "Globals.h"
#include "Helpers.h"
#include "Blread.h"
#include "Tfacep.h"
#include "Callnet.h"
#include "Allheaders.h"
int main (int argc,char **argv) {
if (argc!=3) {
printf ("usage:%s <bmp file> <txt file>/n", argv[0]);
return-1;
}
Char *image_file=argv[1];
Char *txt_file=argv[2];
STRING text_out;
struct Timeval beg,end;
TESSERACT::TESSBASEAPI API;
Image Image;
Api. Init (argv[0], "Chi_sim", NULL, 0, false);//Initialize API object
Api. Setpagesegmode (tesseract::P sm_auto);//Set up automatic layout analysis
Api. Setaccuracyvspeed (tesseract::avs_fastest);//fastest speed required
if (Image.read_header (image_file) < 0) {//Read meta information for BMP file
printf ("Read of file%s failed./n", image_file);
Exit (1);
}
if (Image.read (Image.get_ysize ()) < 0) {//Read BMP file
printf ("Read of image%s error/n", image_file);
Exit (1);
}
Invert_image (&image);//invert each pixel of the image even if 1 changes to 0,0 1
int bytes_per_line = Check_legal_image_size (Image.get_xsize (),
Image.get_ysize (),
IMAGE.GET_BPP ());//Calculates the number of bytes per row of pixels
Api. SetImage (Image.get_buffer (), Image.get_xsize (), Image.get_ysize (),
IMAGE.GET_BPP ()/8, bytes_per_line);//Set image
Gettimeofday (&beg,null);
char* Text = API. Getutf8text ();//Identify the text in the image
Gettimeofday (&end,null);
printf ("%s:reconize sec=%f/n", argv[0],end.tv_sec-beg.tv_sec+ (Double) (end.tv_usec-beg.tv_usec)/1000000.0);// Time to print recognition
Text_out + = text;
delete [] text;
file* fout = fopen (Txt_file, "w");
Fwrite (Text_out.string (), 1, Text_out.length (), fout);//write the recognition result to the output file
Fclose (Fout);
}
Then write a makefile file as follows:
All:ocr
Cflags=-wall-g
ldflags=-lz-lm-ltesseract_textord/
-ltesseract_wordrec-ltesseract_classify-ltesseract_dict-ltesseract_ccstruct/
-ltesseract_ccstruct-ltesseract_cutil-ltesseract_viewer-ltesseract_ccutil/
-ltesseract_api-ltesseract_image-ltesseract_main-llept
Ld_library_path =
includes=-i/usr/local/include/tesseract/-i/usr/local/include/leptonica/
%.o:%.cpp
g++-C $ (CFLAGS) $ (includes) $ (SOURCE)-o [email protected] $<
ocr:ocr.o
g++-o [email protected] $^-G $ (Ld_library_path) $ (ldflags)
Clean
RM OCR.O
Run in this directory make compile executable OCR, run./OCR 1.bmp 1.txt can write the image 1.bmp recognition results to 1. TXT, the program will print the recognition time. It is worth noting that tesseract Chinese recognition is very slow, running a few minutes is normal. Do not know which shrimp know how to tune?
More depressing is that tesseract does not support multi-threading and cannot run multiple instances in the same process.
Other Reference blogs:
1, http://blog.csdn.net/zhoushuyan/article/details/5948289
2, http://www.blogjava.net/wangxinsh55/archive/2011/03/22/346787.html
3, http://haiquan.iteye.com/blog/945701
4, http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html
5, http://www.cnblogs.com/physoft/archive/2011/07/15/2107417.html
6, http://hi.baidu.com/kuliuheng/blog/item/aae32d32216a9fcda2cc2ba1.html
7, Http://code.google.com/p/leptonica/downloads/list
8, http://tesseract-ocr.repairfaq.org/
9, http://blog.wudilabs.org/entry/f25efc5f/
-
Top
-
0
-
Step
About the training and use of TESSERACT-OCR3