Official methods to improve the success rate of tesseract recognition

Last Update:2014-11-29 Source: Internet

Author: User

Tags image processing library

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Improving the quality of the output

There is a variety of reasons you might don't get good quality output from tesseract. It ' s important to note this unless you ' re using a very unusual font or a new language retraining tesseract are unlikely to Help.

Improving the quality of the output

Cp.
Image processing

Binarisation
Noise
Orientation/skew
Borders

Segmentation method
Dictionaries, word lists, and patterns
Still having problems?

Cp.

Tesseract works best with the text using a DPI of at least-dpi, so it is beneficial to resize images. For more information see the FAQ.

Image processing[is preprocessing, is not mentioning OpenCV, it seems OpenCV is not so famous]

Tesseract does various image processing operations internally (using the Leptonica Library) before doing the actual OCR. It generally does a very good job of this, but there would inevitably be cases where it isn ' t good enough, which can result In a significant reduction in accuracy.

Can see how tesseract have processed the image by using the configuration variable tessedit_write_images to true when running tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before Passi ng the image to tesseract, whether with a dedicated postprocessing tool like Scan Tailor or Unpaper, using a graphics edit Or like ImageJ or Gimp, with a batch image editor like ImageMagick, or the code using an image processing library like lept Onica.

Binarisation "If this kind of thing can be recognized, then the business card or anything is a weak explosion."

This is converting a image to black and white. Tesseract does this internally, but it can make mistakes, particularly if the page background is of uneven darkness.

Noise

Noise is a random variation of brightness or colour in an image, which can make the text of the the image more difficult to read. Certain types of noise cannot be removed by tesseract in the Binarisation step, which can cause accuracy rates to drop.

Orientation/skew

This is the when a page has been scanned if not straight. The quality of Tesseract ' s line segmentation reduces significantly if a page is too skewed, which severely impacts the qua Lity of the OCR. To address this rotating, the page image so, the text lines is horizontal.

Borders

Scanned pages often has dark borders around them. These can erroneously picked up as extra characters, especially if they vary in shape and gradation.

Segmentation method

By default Tesseract expects a page of the text when it segments an image. If you ' re just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a border to the text could also help, see issue 398. "Here's a new way to identify ROI"

Dictionaries, word lists, and patterns

By default tesseract are optimised to recognise sentences of words. If you ' re trying to recognise something else, like receipts, price lists, or codes, there is a few things you can do to I Mprove the accuracy of your results, as well as double-checking, the appropriate segmentation method is selected.

Disabling the dictionaries tesseract uses should increase recognition if most of the your text isn ' t dictionary words. They can disabled by setting the both of the configuration variables load_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list tesseract uses to help recognition, or to add common character patterns, which can further help and improve accuracy if you had a good idea of the sort of input you expect. This was explained in more detail in the Tesseract manual. [Manual, found here]

If you know your encounter a subset of the characters available in the language, such as only digits, you can use Thetessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you ' ve tried the above and is still getting low accuracy results, ask on the forum for help, ideally posting an Exampl E image.

Official methods to improve the success rate of tesseract recognition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More