Go TESSERACT-OCR Learning Series

Source: Internet
Author: User

Reprint Address: Http://www.jianshu.com/p/a53c732d8da3Tesseract-OCR Learning Series (c) Simple example tesseract API Basic Example using CMake Configuration

Reference Document: Https://github.com/tesseract-ocr/tesseract/wiki/APIExample

The API provided by Tesseract can be found in the baseapi.h file. However, if there are no examples to fly us for a while, it is quite difficult to figure out how to invoke the Tesseract API.

We know that if we want to invoke a third-party library, we need to add in the properties of the project:

    1. The location of the third-party library header file.
    2. The location of the third-party library file.
    3. In a third-party library, you need the file name of the linked Lib file.

Also, debug and release need to be configured separately. Manual configuration really is very troublesome. And, even if you are configured, if you change the location of the third-party library, then I am sorry, please reconfigure, if you want to put the project to others to use, and other people's third-party library placed in a different location than you put, sorry, need to reconfigure; If you want to change an operating system for development, sorry, please reconfigure! Is there any way to get around these troublesome things, making only trouble once, and always easy? The answer is yes. The tool is CMake. See my other article: CMake a brief tutorial. Here, I give you an example of how to use CMake to add a third-party library.

First, we need to centralize the contents provided by the third-party library tesseract. For example, I set up a tesseract folder in the extralib of the F-drive. Folders, folder folders, folders, bin include and folders lib tessdata . which

    • bin: Store the. dll file.
    • include: Store. h files.
    • Lib: Store the. lib file.
    • tessdata: Store the. traineddata file.

Bin folder
Include folder

Among them, tesseract. h files are relatively scattered. I searched all the. h files directly in the original tessract, and then copied them here.


Lib folder
Tessdata folder

Which, chi_sim on behalf of Simplified Chinese, eng is needless to say, on behalf of English. The contents of the Tessdata folder can be downloaded to the official website.

Well, now that we have such a folder, the following goal is for CMake to find these folders. To achieve this, you first need to write your own name as TesseractConfig.cmake a file and put it in the Tesseract folder you just created. So, the Tesseract folder eventually looks like this:


Tesseract folder

If CMake can find tesseractconfig.cmake this file, the Find_package function can be used to find the path of each folder in Tesseract. But the question is, cmake how to find tesseractconfig.cmake this file? In the context of the Windows operating system, there are two ways of doing this:

    1. Add the folder path where the file is tesseractconfig.cmake to the path of the system environment variable.
    2. Manually configured in the GUI interface of the CMake.

Before the formal introduction, let's take a look at how to write in Tesseractconfig.cmake:

# ===================================================================================# The Tesseract CMake configuration file## Usage from an external project:# in your CMakeLists.txt, add these lines:## find_package (tesseract REQUIRED)# target_link_libraries (My_target_name ${tesseract_libs})## This file would define the following variables:#-Tesseract_libs:the List of libraries to link against.#-Tesseract_lib_dir:the directory (es) where LIB files are. Calling# link_directories with the This path was not needed.#-Tesseract_include_dirs:the tesseract INCLUDE directories.#-Tesseract_version:the VERSION of this tesseract build. Example: "2.4.0"#-Tesseract_version_major:major VERSION part of Tesseract_version. Example: "2"#-Tesseract_version_minor:minor VERSION part of Tesseract_version. Example: "4"#-Tesseract_version_patch:patch VERSION part of Tesseract_version. Example: "0"## Advanced Variables:#-Tesseract_config_path## ===================================================================================Set (tesseract_version_major3)Set (Tesseract_version_minor4)Set (Tesseract_version_patch1)Set (tesseract_version${tesseract_version_major}.${tesseract_version_minor}.${tesseract_version_patch}) get_filename_component (Tesseract_config_path"${cmake_current_list_file} "PATH CACHE"Set (Tesseract_lib_dir"${tesseract_config_path}/lib ")Set (Tesseract_include_dirs"${tesseract_config_path}/include ")set (tesseract_libs_dbg  "Liblept171d.lib"  "Libtesseract304d.lib") set (tesseract_libs_opt  "Liblept171.lib"  "Libtesseract304.lib") foreach (__tesslib ${tesseract_libs_ DBG}) List (APPEND tesseract_libs debug  "${tesseract_lib_dir} /${__tesslib} ") Endforeach () foreach (__tesslib ${ TESSERACT_LIBS_OPT}) List (APPEND tesseract_libs optimized  "$ {Tesseract_lib_dir}/${__tesslib} ") Endforeach () set ( Tesseract_found TRUE CACHE BOOL  "force"       

Well, get ready to work on this, and then we can start to formally build the sample program Basic-example. Create a new folder first samples . Then create a new folder in the samples folder Basic-example and create a new file CMakeLists.txt.


Samples folder

The CMakeLists.txt here can be very simple (and of course it can be complicated, but as an example, it should be simpler).

cmake_minimum_required(VERSION 3.0)project(tesseract-api-examples)add_subdirectory(Basic-example)

The first sentence indicates that the CMake has a minimum version number of 3.0 (less than CMake 3.0 cannot be built). The second sentence means building a solution named Tesseract-api-examples. The third sentence indicates the addition of subdirectories Basic-example . The meaning of adding subdirectories is actually starting to execute CMakeLists.txt in subdirectories. So, if you want to add_subdirectory add subdirectories, you have to make sure that the CMakeLists.txt file is in this subdirectory.

Now, let's go into the Basic-example folder and create a new two files: Basic-example.cpp andCMakeLists.txt

In Basic-example.cpp , we have the code of the generals online to stick up:

#Include<tesseract/baseapi.h>#Include<leptonica/allheaders.h>IntMain(){Char *outtext; Tesseract::tessbaseapi *api =New Tesseract::tessbaseapi ();Initialize TESSERACT-OCR with 中文版, without specifying Tessdata pathif (Api->init (null,  "Eng")) {fprintf (stderr,  "Could not" Initialize tesseract.\n "); exit (1);} //Open input image with Leptonica library Pix *image = pixread ( "D : \\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif "); Api->setimage (image); //Get OCR result outtext = Api->getutf8text (); printf ( "OCR output:\n%s", outtext); //Destroy used object and release memory Api->end (); delete [] outtext; Pixdestroy (&image); return 0;}         

In CMakeLists.txt, you can use 6 words to complete:

set(the_target "Basic-example")find_package(Tesseract REQUIRED)aux_source_directory(. SRC_LIST)include_directories(${Tesseract_INCLUDE_DIRS})add_executable(${the_target} ${SRC_LIST})target_link_libraries(${the_target} ${Tesseract_LIBS})

which

    • The first line is set to The_target named "Basic-example".
    • The second line looks for tesseract third-party libraries.
    • The third line looks for all the. c files and. cpp files under the current folder, and places the file names in the src_list.
    • Line Fourth adds the third-party library directory tesseract_include_dirs.
    • Line Fifth sets the build target for the project Basic-example to be an executable file.
    • line sixth adds dependent third-party libraries.

Well, everything is ready, it's built! Open the Cmake-gui software.

Sets the source and destination paths of the CMake. If the two paths are not clear, please take a brief tour of CMake.

Click Config

A marquee appears, selecting the C + + compiler you are using. I'm using the VS2012. Click Finish.

After a period of waiting, the following interface appears:

Notice the Tesseract_dir line. I found it automatically on my side. That's because I've put this path in the environment variable's path. You can choose to place your path in the environment variable, or you can manually select this directory here. If it is manually selected, then this directory will be saved in the cache and the next configuration will not need to be selected again.

Click Configure again.

The red stripe disappears and the message bar shows configuringdone. At this point, click Generate.

Build succeeded! Next, you can open the project file under the Build folder tesseract-api-examples.sln .

Set basic-example as the startup item. Build, Success!

Run! Ah, Oh!

Alas, sorry, too excited, the brain is a remnant! We now also need to place the bin folder of the tesseract into the path of the environment variable so that the program can find the DLL file.

Now you can start debugging the program.


Phototest.tif

Ok. Run the program.

Successful Execution ~

Let's go back and look at this example program. See what it did.

#Include<tesseract/baseapi.h>#Include<leptonica/allheaders.h>IntMain(){Char *outtext; Tesseract::tessbaseapi *api =New Tesseract::tessbaseapi ();Initialize TESSERACT-OCR with 中文版, without specifying Tessdata pathif (Api->init (null,  "Eng")) {fprintf (stderr,  "Could not" Initialize tesseract.\n "); exit (1);} //Open input image with Leptonica library Pix *image = pixread ( "D : \\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif "); Api->setimage (image); //Get OCR result outtext = Api->getutf8text (); printf ( "OCR output:\n%s", outtext); //Destroy used object and release memory Api->end (); delete [] outtext; Pixdestroy (&image); return 0;}         

Two header files are included first:

#include <tesseract/baseapi.h>#include <leptonica/allheaders.h>

This actually shows that the sample program uses two libraries. One is tesseract, one is Leptonica. Tesseract is used to do OCR. The Leptonica can handle the basic image processing needs.

Next, in the main function, an object is defined:

new tesseract::TessBaseAPI();

Where Tesseract is the namespace. Tessbaseapi is a class name. The comments for this class are written like this:

/*** Base class for all tesseract APIs.* Specific classes can add ability to work on different inputs or produce* different outputs.* This class is mostly an interface layer on top of the Tesseract instance* class to hide the data types so that users of this class don‘t have to* include any other Tesseract headers.*/

Other words:

All of the Tesseract APIs are in this class.

So, if we figure this out, we'll know all the calling methods of the Tesseract API. Good thing ~ This class will come back to see it for a while. Read through the code first.

    // Initialize tesseract-ocr with English, without specifying tessdata path    if (api->Init(NULL, "eng")){        fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); }

Take a look at the comments from Init ~

/** * Instances is now mostly Thread-safe and totally independent, * but some global parameters remain.  Basically it is safe-to-use multiple * tessbaseapis-different threads in parallel, unless: * The use of setvariable on   Some of the Params in classify and Textord.   * If You don't, then the effect would be is to change it for all your instances. * * Start tesseract.   Returns Zero on success and-1 on failure.   * NOTE that the called before Init is those * listed above here in the class definition. * The DataPath must is the name of the parent directory of Tessdata and * must end in/.   Any name after the Last/will is stripped.   * The language is (usually) an ISO 639-3 string or NULL would default to Eng. * It is entirely safe (and eventually'll be efficient too) to call * Init multiple times on the same instance to Chang   e language, or just * to reset the classifier. * The language may be a string of the form [~]&LT;LANG&GT;[+[~]&Lt;lang>]* indicating * that multiple languages is to be loaded. Eg Hin+eng would load Hindi and * 中文版. Languages may specify internally this they want to being loaded * with one or more other Languages, so the-sign is Availa ble to override * that. Eg if Hin were set to load eng by default and then Hin+~eng would force * loading only Hin. The number of loaded languages is limited only by * memory, with the caveat that loading additional languages would IMPAC T * Both speed and accuracy, as there are more work to does to decide on the * applicable language, and there are more Cha   NCE of hallucinating incorrect * words. * Warning:on changing languages, all tesseract parameters is reset * Back to their default values.   (which may vary between languages.) * If you had a rare need to set a Variable this controls * initialization for a second call to Init you should explicit LY * Call End () and then use SetVariable before Init. Very * Rare use case, sInce there is very few uses that require any parameters * to be set before Init.   * * If Set_only_non_debug_params is true, only params this do not contain * "Debug" in the name would be set. */

Look at such a long English estimate is quite tired, I would like to translate:

Instances are thread-safe in most cases and are completely independent. However, some global parameters are still preserved. It is safe to basically use multiple tessbaseapis in parallel in different threads, unless you use setvariable to change the values of certain parameters. If you do this, all of your instances will change in effect.

Start Tesseract. If successful, returns 0 if the failure returns-1. Note the member functions that can be called before the Init method are those that are listed before init in the class definition.

The DataPath must be a parent directory for Tessdata and must be/terminate. The last/subsequent characters will be deleted. The language parameter is usually a iso639-3 string, and if NULL is set to ENG by default. In a single instance, it is no problem to call the Init method multiple times to change the language or reset the classifier (and it will gradually become faster).

The language parameter can be written in the form of [~]<lang>[+[~]<lang>]*, which indicates that multiple languages can be loaded. For example Hin+eng will load Hindi and English. The languages can be set internally as one or more languages, so the ~ symbol can be used to overwrite. For example, if the Hin is set to load eng by default, Hin+~eng forces the Hin to load only. The number of languages that can be loaded is limited only by memory, but loading multiple languages affects both speed and accuracy. Because it requires more work to decide which language it is, and is more likely to produce errors.

warning : Once the language is changed, all tesseract parameters are reset to their default values. (Each language may be different.) )

And then look at the code:

    // Open input image with leptonica library    Pix *image = pixRead("D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif");

pixReadis a function of Leptonica, which reads a picture and saves the result of the picture in the PIX structure.

    api->SetImage(image);

SetImageThe Tesseract function provides a picture to recognize.

    // Get OCR result    outText = api->GetUTF8Text();

GetUTF8TextThe function recognizes the text in the picture and returns the char* array.

    // Destroy used object and release memory    api->End();    delete [] outText;    pixDestroy(&image);

The last part is release and destruction.

About the End method, the comments in the code are written in this way

  /**   * Close down tesseract and free up all memory. End() is equivalent to   * destructing and reconstructing your TessBaseAPI.   * Once End() has been used, none of the other API functions may be used   * other than Init and anything declared above it in the class definition.   */  void End();

Finally, the array and image are freed. There's nothing to say, it's reasonable.

If you need a complete sample file and CMakeLists.txt, click here to download.

Go TESSERACT-OCR Learning Series

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.