Official methods to improve the success rate of tesseract recognition

Source: Internet
Author: User
Tags image processing library

Improving the quality of the output

There is a variety of reasons you might don't get good quality output from tesseract. It ' s important to note this unless you ' re using a very unusual font or a new language retraining tesseract are unlikely to Help.

    • Improving the quality of the output
      • Cp.
      • Image processing
        • Binarisation
        • Noise
        • Orientation/skew
        • Borders
      • Segmentation method
      • Dictionaries, word lists, and patterns
      • Still having problems?

Cp.

Tesseract works best with the text using a DPI of at least-dpi, so it is beneficial to resize images. For more information see the FAQ.

Image processing[is preprocessing, is not mentioning OpenCV, it seems OpenCV is not so famous]

Tesseract does various image processing operations internally (using the Leptonica Library) before doing the actual OCR. It generally does a very good job of this, but there would inevitably be cases where it isn ' t good enough, which can result In a significant reduction in accuracy.

Can see how tesseract have processed the image by using the configuration variable tessedit_write_images to true when running tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before Passi ng the image to tesseract, whether with a dedicated postprocessing tool like Scan Tailor or Unpaper, using a graphics edit Or like ImageJ or Gimp, with a batch image editor like ImageMagick, or the code using an image processing library like lept Onica.

Binarisation "If this kind of thing can be recognized, then the business card or anything is a weak explosion."

This is converting a image to black and white. Tesseract does this internally, but it can make mistakes, particularly if the page background is of uneven darkness.

Noise

Noise is a random variation of brightness or colour in an image, which can make the text of the the image more difficult to read. Certain types of noise cannot be removed by tesseract in the Binarisation step, which can cause accuracy rates to drop.

Orientation/skew

This is the when a page has been scanned if not straight. The quality of Tesseract ' s line segmentation reduces significantly if a page is too skewed, which severely impacts the qua Lity of the OCR. To address this rotating, the page image so, the text lines is horizontal.

Borders

Scanned pages often has dark borders around them. These can erroneously picked up as extra characters, especially if they vary in shape and gradation.

Segmentation method

By default Tesseract expects a page of the text when it segments an image. If you ' re just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a border to the text could also help, see issue 398. "Here's a new way to identify ROI"

Dictionaries, word lists, and patterns

By default tesseract are optimised to recognise sentences of words. If you ' re trying to recognise something else, like receipts, price lists, or codes, there is a few things you can do to I Mprove the accuracy of your results, as well as double-checking, the appropriate segmentation method is selected.

Disabling the dictionaries tesseract uses should increase recognition if most of the your text isn ' t dictionary words. They can disabled by setting the both of the configuration variables load_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list tesseract uses to help recognition, or to add common character patterns, which can further help and improve accuracy if you had a good idea of the sort of input you expect. This was explained in more detail in the Tesseract manual. [Manual, found here]

If you know your encounter a subset of the characters available in the language, such as only digits, you can use Thetessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you ' ve tried the above and is still getting low accuracy results, ask on the forum for help, ideally posting an Exampl E image.

Official methods to improve the success rate of tesseract recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.