Tesseract identifies characters in the specified character range

Source: Internet
Author: User
Tags tesseract ocr

You can configure Tesseract to use tesseract for OCR. The C # version of opencv and opencv emgu both integrate the Tesseract tool.

However, misjudgment often occurs during use, such as recognizing "S" as "5" and "1" as "L" or "I ". You can set parameters to recognize characters in a specified range.


The following is the API documentation for this function in emgu:

Emgu. cv. OCR. tesseract. tesseract (string, String, emgu. cv. OCR. tesseract. ocrenginemode, string)

Public tesseract (string datapath, string language, emgu. cv. OCR. tesseract. ocrenginemode mode, string whitelist)
Member of emgu. cv. OCR. tesseract

Summary:
Create an tesseract OCR engine.

Parameters:
Datapath: the datapath must be the name of the parent directory of tessdata and must end in/. Any name after the last/will be stripped.
Language: The Language is (usually) an ISO 639-3 string or null will default to Eng. it is entirely safe (and eventually will be efficient too) to call init multiple times on the same instance to change language, or just to reset the classifier. the language may be a string of the form [~] % Lt; lang> [+ [~] <Lang>] * indicating that multiple versions are to be loaded. eg hin + Eng will load Hindi and English. ages may specify internally that they want to be loaded with one or more other ages, so ~ Sign is available to override that. Eg if hin were set to load Eng by default, then hin ++ ~ Eng wocould force loading only hin. the number of loaded versions is limited only by memory, with the caveat that loading additional versions will impact both speed and accuracy, as there is more work to do to decide on the applicable language, and there is more chance of hallucinating incorrect words.
Mode: OCR engine Mode
Whitelist: this can be used to specify a white list for OCR. e.g. Specify "1234567890" to recognize digits only. Note that the White List currently seems to only work with ocrenginemode. oem_tesseract_only



Tesseract tesseract = new tesseract ();

Tesseract. INIT (path, Lang, tesseract. ocrenginemode. oem_tesseract_only); // path indicates the language package path, and Lang indicates the language

Tesseract. setvariable ("tessedit_char_whitelist", "0123456789 ");

The above code can recognize only numbers, which will greatly improve the accuracy of recognition. Change "0123456789" to "abcdefghijkmnopqrstuvwxyz" to recognize only letters.

Before setting:

After setting:

It seems that there will still be mistakes...

Don't care about these details


Tesseract identifies characters in the specified character range

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.