Tesseract 3.x architecture and principle analysis of OCR----__tesseract

Source: Internet
Author: User
Tags class definition parent directory
The history of Tesseract

Tesseract is an Open-source OCR engine, and Hewlett-Packard's Bristol Lab was developed in 1984-1994. It was originally a text-recognition engine for HP's flat-panel scanners. Tesseract in the 1995 UNLV OCR character recognition accuracy test, received wide attention. Later HP gave up the OCR market. After 1994, the development of Tesseract was stopped.

In 2005, HP contributed tesseract to the open source community. The Nevada Institute of Information Technology obtained the source, while Google began to expand and optimize the Tesseract function. Currently, tesseract as an open source project on Google project to regain new life. The latest version of Tesseract is 3.02, which supports more than 60 languages, providing an engine and a command-line tool, an official download address: Guzhenping's portal.

Tesseract Schema Resolution

Tesseract Engine features a powerful, broadly can be divided into two parts: image layout Analysis character segmentation and recognition

Image layout Analysis, is the preparation of character recognition. Work content: Through a mixed tab-based page layout analysis method, the image of the table, text, pictures and other content to distinguish.

Character segmentation and recognition is the design goal of the whole tesseract, and the work content is most complicated. First is the character cutting, Tesseract uses two steps strategy: uses the space between the characters to carry on the rough segmentation , obtains most characters, simultaneously also has the adhesion character or the wrong segmentation character. Here will be the first character recognition, through the character area type judgment, according to the result of the comparison character library character recognition. According to the identified characters, the segmentation of the adhesion characters, while the wrong segmentation of the words to meet and complete the fine segmentation of characters .

There is, of course, another way of saying that it can be divided into four parts: analyzing connected areas , finding block areas and finding text lines and words (identifying) Text

Figure Tesseract Main four parts (only representative Guzhenping personal opinion, do not plagiarize)


Give a detailed example:
PS: This example is also represented in Ray Smith's article (adapting the tesseract Open Source OCR Engine for multilingual OCR).

Do not want to paste the text directly above:

Tesseract is not the framework of my this words can speak clearly, welcome message to add correction. Thank you for your cooperation.

tesseract Realization Principle

The principle of this piece is quite complex, this blog only talk about Tessbaseapi related things. Follow-up series to be added.

Tessbaseapi is a core class of the Tesseract engine, please stamp here the source code of this class: the Guzhenping portal. Let's understand the operation mechanism of this kind of function, and use this to associate the implementation principle of the Tesseract engine. The mechanism is as follows: Invoke the Init () method, that is, to invoke the SetImage () method on the engine initialization, and to set the information of the graph stream to get the text information by the Getutf8text () method to call the Recognizedtext class, to judge the correctness of text and then output. Here, the own trim () method and the length () method are invoked to do some corresponding processing.

About the Init () method, the official API introduction:

Instances are now mostly thread-safe and totally independent, but some global parameters >remain. Basically it is safe to use multiple Tessbaseapis in different threads in parallel, unless:you use setvariable on some of The Params in classify and Textord. If you do, then the effect'll be is to change it for all your instances.

Start tesseract. Returns Zero on success and-1 on failure. The "only" members of this is called before Init are those listed above here the class definition.

The datapath must is the name of the parent directory of Tessdata and must end in/. Any name after the Last/will is stripped. The language is (usually) an ISO 639-3 string or NULL would default to Eng. It is entirely safe (and eventually'll be efficient too) to call Init multiple times on the same instance to change Lang Uage, or just to reset the classifier. The language may is a string of the form [~][+[~]]* indicating that multiple languages are to be loaded. Eg Hin+eng'll load Hindi and 中文版. Languages may specify internally this they want to being loaded with one or more other Languages, so-sign is available To override that. Eg if Hin were set to load eng by default, then Hin+~eng would force-only loading. The number of loaded languages is limited only by memory, with the caveat then loading additional languages'll impact bo th speed and accuracy, as there are more work to doing to decide on the applicable language, and there are more chance of hallu cinating Incorrect Words. Warning:on changing languages, all tesseract parameters are reset back to their default values. (Which may vary between languages.) If you are have a rare need to set a Variable this controls initialization for a second call to Init you should explicitly cal L end () and then use SetVariable before Init. This is the very rare use case, since there are very few uses ' require any parameters to be set before Init.

If Set_only_non_debug_params is true, only params of that does not contain ' debug ' in ' ' ' is set.

The datapath must is the name of the data directory (no ending/) or some other file in which the data directory resides ( For instance argv[0]. The language is (usually) an ISO 639-3 string or NULL would default to Eng. If Numeric_mode is true, then only digits and Roman numerals'll be returned. Returns 0 on Success and-1 on initialization failure.

Other functions of the introduction, to read some APIs, I will not post. API Portal

Figure Tessbaseapi Some of the built-in methods

the end

I feel I must have not written clearly, welcome to comment.
The party teachers are particularly expected to criticize.
Thank you.

Content from Guzhenping Blog, respect for the original, reproduced please indicate the source!
Thank you ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.