About the training and use of TESSERACT-OCR3

Source: Internet
Author: User

As we all know, this is an excellent character recognition software. This open source project can be downloaded from http://code.google.com/p/tesseract-ocr/downloads/list.

When using, it is recommended to use 3 instead of 2, for some reason, 2 can be used directly in the project, but due to some obvious bugs and other reasons, many causes the program to not run or even crash. So we recommend using the command-line version of 3.

In addition to downloading the Tesseract installer, you can also download some language libraries from the download page, or select some language libraries during the installation process.

First, training

In many cases, the default font can be highly accurate, but sometimes we need to train our own library to use. The training steps are as follows:

Note:

A, the DOS command as an example, the following command will be saved to each step. Bat runs, or runs cmd into the directory where tesseract runs

B. Ensure the consistency of file names (DDT is here)

0. Copy all files in training directory to tesseract3 directory

copy. \training\*.exe. \

1. Mark Border

Tesseract ddt.tif ddt-l eng digits batch.nochop Makebox

Explain

Ddt.tif for the file to be recognized, support Jpg,gif,tiff and other formats, we recommend using TIF

DDT is the name of the file to be saved (auto-add extension. box)

The library used by-L ENG, this parameter allows us to choose which font to use to mark the border

The following are the configuration files, that is, the other parameters of Tesseract are loaded as files, rather than directly input parameters

digits specifies a number that recognizes only 0-9 (of course you can edit it to include more characters), and when you don't need to specify it, be sure to remove the parameter, but using this character set limit minimizes the chance of your editing Ddt.box's dizzy by being incorrectly identified.

Note:

This step is critical, but there are often problems, even if you download bbtesseract in http://code.google.com/p/bbtesseract/downloads/list, so I feel I should make a border recognition, But there is no time to do it. It is perfectly possible to write commands to the software for graphical use. Therefore, be careful to edit the Ddt.box file that you generated to ensure that the characters are recognized and the border is correct.

There is also a small trick, such as I have done such a tif:1.2-34567089, in this step, only identified 2-9 this part, so I modified TIF to: 001.2-34567089, it is all recognized. Maybe I can give you some inspiration.

2. Form a language library

Tesseract ddt.tif ddt-l eng digits Nobatch Box.train
Unicharset_extractor Ddt.box
Rename Unicharset Ddt.unicharset
Mftraining-u Unicharset-o Ddt.unicharset ddt.tr
Rename Inttemp ddt.inttemp
Rename Pffmtable ddt.pffmtable
Rename Microfeat DDT. Microfeat
Cntraining ddt.tr
Rename Normproto Ddt.normproto
Combine_tessdata DDT.

This contains a number of steps, but the other people to the "tutorial" has been more verbose, no longer verbose.

Note: Those few rename are necessary because the generated file only has an extension. As long as you pay attention to these, there is no problem.

3. Test Language Library

Copy Ddt.traineddata. \tessdata\ddt.traineddata

Tesseract ddt.tif Ddt-l DDT
Notepad ddt.txt

If the test fails, you should check:

A, whether the TIF width is too small, if so, I suggest you add a line below, that is, to change 1 rows to 2 lines, add something, feel free to add some characters inside your font, but preferably as wide as the image.

B, if not correctly identified, look back and check your Ddt.box

If you fail, remember to clean up the file you generated earlier, you can use the command:

Copy ddt.tif Tmp.tif
Del ddt.*/f/s
Copy tmp.tif Ddt.tif
Del tmp.tif

Then start over again from the first step.

Second, use

When used, it is important to note that for an image with a single line and fewer characters, if it is not recognized, it is best to add a row of useless rows below and ensure that the line basically reaches the image width.

Note:

You may not find a font when you use it (especially if you reinstall tesseract when you uninstall), you should modify

The value of Hkey_current_user\environment\tessdata_prefix is the directory where your tesseract resides.

Iii. examples

Finally, we give a sample code that TESSERACT3 uses under vb.net.

Public Class TESSOCR

Dim path as String = My.Application.Info.DirectoryPath & "\tesseract3\"

Sub New ()
My.Computer.Registry.CurrentUser.OpenSubKey ("Environment", True). SetValue ("Tessdata_prefix", Path)
End Sub

Public Function TESS3OCR (ByVal Rect as Rectangle, ByVal CLR as Integer) as String
' Create an image, note that the screen is copied using sourcecopy to conform to the OCR-compliant image format, or else error or close directly
Dim bmp as Bitmap = New Bitmap (rect.width, Rect.height * 2)
Dim gr as Graphics = Graphics.fromimage (BMP)
Gr. Clear (Color.White)
Gr. CopyFromScreen (Rect.location, Point.empty, Rect.size, copypixeloperation.sourcecopy)
' Corrected to black and white
For y as Integer = 0 to BMP. Height-1
For x as Integer = 0 to BMP. Width-1
If bmp. GetPixel (x, y). ToArgb = CLR then bmp. SetPixel (x, y, Color.Black) Else bmp. SetPixel (x, y, Color.White)
Next
Next
Dim str as String = IIf (CLR = Anglecolor, "45.000000", "0.000000")
Gr. DrawString (str, New Font ("Arial Black", +), Brushes.black, 0, Rect.height)

Bmp. Save (Path & "Tmp.tif", System.Drawing.Imaging.ImageFormat.Tiff)
Shell (Path & "Tesseract" & Path & "Tmp.tif" & Path & "Tmp-l DDT digits", appwinstyle.hide, True)
My.Computer.FileSystem.DeleteFile (Path & "Tmp.tif")
DIM ret as String = My.Computer.FileSystem.ReadAllText (Path & "Tmp.txt"). Split (VbCrLf) (0)
My.Computer.FileSystem.DeleteFile (Path & "Tmp.txt")
Return ret
End Function

End Class

In the new function of the code, I modified the registry to prevent errors, and a better practice would be to record the original value and restore it when the class was destroyed. After that, it points out some of the problems that may exist on screen copying, and of course, if you are taking a verification code, you don't have to worry about it. Then a simple correction of the image, it is important to note that must be corrected to the white black word before the line, otherwise not recognized. I then added a line of useless text below and handled it appropriately when I returned the value. One more thing to note is that the last parameter of the Shell function indicates the end of the wait call process, and if you are going to use it in VB6, you need to implement the wait with the API--instead of waiting for the function with sleep and so on, that will make your program less robust.

Transferred from: http://blog.csdn.net/foxwit/article/details/6547465

How to use OCR recognition engine tesseract

Recently has been working with OCR, learning the next Google's OCR engine tesseract, is a good identification tool. TESSERACT-3.0 has supported layout analysis and is very powerful. Leptonica and Libtiff can be installed selectively prior to installing Tesseract. However, it is recommended that you install both libraries first. BMP files can only be processed without TIFF installed.

Here is just a description of how to recognize Chinese. After installing Libtiff,leptonica and tesseract in turn, download the training data for Simplified Chinese and Traditional Chinese, which can be found on the download page of Tesseract. To the Tessdata folder in a directory. Then set the directory for the environment variable tessdata_prefix=tessdata. Then, create a new Ocr.cpp file and write the following code:

#include <mfcpch.h>

#include <ctype.h>

#include <sys/time.h>

#include "applybox.h"

#include "control.h"

#include "Tessvars.h"

#include "Tessedit.h"

#include "baseapi.h"

#include "Thresholder.h"

#include "Pageres.h"

#include "Imgs.h"

#include "Varabled.h"

#include "tprintf.h"

#include "Stderr.h"

#include "notdll.h"

#include "Mainblk.h"

#include "Output.h"

#include "Globals.h"

#include "Helpers.h"

#include "Blread.h"

#include "Tfacep.h"

#include "Callnet.h"

#include "Allheaders.h"

int main (int argc,char **argv) {

if (argc!=3) {

printf ("usage:%s <bmp file> <txt file>/n", argv[0]);

return-1;

}

Char *image_file=argv[1];

Char *txt_file=argv[2];

STRING text_out;

struct Timeval beg,end;

TESSERACT::TESSBASEAPI API;

Image Image;

Api. Init (argv[0], "Chi_sim", NULL, 0, false);//Initialize API object

Api. Setpagesegmode (tesseract::P sm_auto);//Set up automatic layout analysis

Api. Setaccuracyvspeed (tesseract::avs_fastest);//fastest speed required

if (Image.read_header (image_file) < 0) {//Read meta information for BMP file

printf ("Read of file%s failed./n", image_file);

Exit (1);

}

if (Image.read (Image.get_ysize ()) < 0) {//Read BMP file

printf ("Read of image%s error/n", image_file);

Exit (1);

}

Invert_image (&image);//invert each pixel of the image even if 1 changes to 0,0 1

int bytes_per_line = Check_legal_image_size (Image.get_xsize (),

Image.get_ysize (),

IMAGE.GET_BPP ());//Calculates the number of bytes per row of pixels

Api. SetImage (Image.get_buffer (), Image.get_xsize (), Image.get_ysize (),

IMAGE.GET_BPP ()/8, bytes_per_line);//Set image

Gettimeofday (&beg,null);

char* Text = API. Getutf8text ();//Identify the text in the image

Gettimeofday (&end,null);

printf ("%s:reconize sec=%f/n", argv[0],end.tv_sec-beg.tv_sec+ (Double) (end.tv_usec-beg.tv_usec)/1000000.0);// Time to print recognition

Text_out + = text;

delete [] text;

file* fout = fopen (Txt_file, "w");

Fwrite (Text_out.string (), 1, Text_out.length (), fout);//write the recognition result to the output file

Fclose (Fout);

}

Then write a makefile file as follows:

All:ocr

Cflags=-wall-g

ldflags=-lz-lm-ltesseract_textord/

-ltesseract_wordrec-ltesseract_classify-ltesseract_dict-ltesseract_ccstruct/

-ltesseract_ccstruct-ltesseract_cutil-ltesseract_viewer-ltesseract_ccutil/

-ltesseract_api-ltesseract_image-ltesseract_main-llept

Ld_library_path =

includes=-i/usr/local/include/tesseract/-i/usr/local/include/leptonica/

%.o:%.cpp

g++-C $ (CFLAGS) $ (includes) $ (SOURCE)-o [email protected] $<

ocr:ocr.o

g++-o [email protected] $^-G $ (Ld_library_path) $ (ldflags)

Clean

RM OCR.O

Run in this directory make compile executable OCR, run./OCR 1.bmp 1.txt can write the image 1.bmp recognition results to 1. TXT, the program will print the recognition time. It is worth noting that tesseract Chinese recognition is very slow, running a few minutes is normal. Do not know which shrimp know how to tune?

More depressing is that tesseract does not support multi-threading and cannot run multiple instances in the same process.

Other Reference blogs:

1, http://blog.csdn.net/zhoushuyan/article/details/5948289

2, http://www.blogjava.net/wangxinsh55/archive/2011/03/22/346787.html

3, http://haiquan.iteye.com/blog/945701

4, http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html

5, http://www.cnblogs.com/physoft/archive/2011/07/15/2107417.html

6, http://hi.baidu.com/kuliuheng/blog/item/aae32d32216a9fcda2cc2ba1.html

7, Http://code.google.com/p/leptonica/downloads/list

8, http://tesseract-ocr.repairfaq.org/

9, http://blog.wudilabs.org/entry/f25efc5f/

Top
0
Step

About the training and use of TESSERACT-OCR3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.