Introduction to the Ocr engine and installation of Tesseract in Python, tesseractocr

Source: Internet
Author: User
Tags tesseract ocr

Introduction to the Ocr engine and installation of Tesseract in Python, tesseractocr
1. Introduction to Tesseract

Tesseract is an open source ocr project supported by google. Its Project address is https://github.com/tesseract-ocr/tesseract. the latest source code can be downloaded here.

Tesseract ocr can also be used in two ways: 1-dynamic librarylibtesseract2-Program Execution Modetesseract. Exe

Because I am also a python rookie, so method 1 does not currently, so I have to adopt method 2.

 

2. Download the Tesseract installation package

The release version of Tesseract: https://github.com/tesseract-ocr/tesseract/wiki/downloads. in this example, the following statement is used:

Currently, there is no official Windows installer for newer versions.

This means that the latest version of windows installation package is not officially provided. Only earlier versions include 3.02.02, which are https://sourceforge.net/projects/tesseract-ocr-alt/files /.

The latest versions 3.03 and 3.05 are third-party Maintenance and Management installation packages, which have several publishers:

3rd party Windows exe's/installer
  • Binaries compiled by @ egorpugin (ref issue #209) https://www.dropbox.com/s/8t54mz39i58qslh/tesseract-3.05.00dev-win32-vc19.zip? Dl = 1

    You have to install VC2015 x86 redist from microsoft.com in order to run them. Leptonica is built with all libs before t for libjp2k.

  • Https://github.com/UB-Mannheim/tesseract/wiki

  • Http://domasofan.spdns.eu/tesseract/

 

Summary:

1. officially released version 3.02: http://downloads.sourceforge.net/project/tesseract-ocr-alt/tesseract-ocr-setup-3.02.02.exe? R = https % 3A % 2F % 2Fsourceforge.net % 2 Fprojects % 2Ftesseract-ocr-alt % 2 Ffiles % 2F & ts = 1464880498 & use_mirror = jaist

2. Version 3.05 issued by the University of Manheim, Germany, http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe

3. Another version maintained by Simon eigelinger (@ DomasoFan): http://3.onj.me/tesseract/. the value is that there is a more detailed description in this network.

If the above version cannot be downloaded during download, You can first try thunder, and then you may need FQ.

I am using the officially released version 3.02, that is, link 1.

 

3. Instructions for using Tesseract ocr

After installation, the default directory C: \ Program Files (x86) \ Tesseract-OCR is used. You need to put this path in the path search path of your operating system, otherwise it will be inconvenient to use later.

In the installation directory C: \ Program Files (x86) \ Tesseract-OCR, you can see the command line execution Program tesseract.exe.

The syntax of tesseract is as follows:

For example, for example, tesseract 1.png output-l eng-psm 7, you can use the English library to identify the image file 1.png, and output the identification result to the output.txt file in the current directory.

1 D: \ python \ lnypcg \ test> tesseract 2 Usage: tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...] 3 4 pagesegmode values are: 5 0 = Orientation and script detection (OSD) only. 6 1 = Automatic page segmentation with OSD. 7 2 = Automatic page segmentation, but no OSD, or OCR 8 3 = Fully automatic page segmentation, but no OSD. (Default) 9 4 = Assume a single column of text of variable sizes.10 5 = Assume a single uniform block of vertically aligned text.11 6 = Assume a single uniform block of text.12 7 = Treat the image as a single text line. #-psm 7 indicates recognition of 13 8 = Treat the image as a single word.14 9 = Treat the image as a single word in a circle.15 10 = Treat the image as a single character.16 -l lang and/or-psm pagesegmode must occur before anyconfigfile. #-l eng stands for English recognition 17 18 Single options: 19-v -- version: version info20 -- list-langs: list available ages for tesseract engine

 

 

4. Tesseract ocr instance

 

In the plain text file.

(For how to perform grayscale processing, you can use the PIL library in python to dig a hole and write it next time .)

The volume in the 1 D: \ python \ lnypcg \ test> dir 2 drive D has no labels. The serial number of the 3 volume is 36D9-CDC7 4 5 D: \ python \ lnypcg \ test directory 6 7 <DIR>. 8 <DIR> .. 9 462 1.png 10 1 files 462 bytes 11 2 directories 25,733,357,568 available bytes 12 13 D: \ python \ lnypcg \ test> tesseract 1.png output-l eng14 Tesseract Open Source OCR Engine v3.02 with Leptonica15 16 D: \ python \ lnypcg \ test> type output.txt 17 757218 19 20 D: \ python \ lnypcg \ test>

 

Conclusion: tesseract is a good OCR engine. The current problem is that the latest Chinese materials are relatively small, and there are too many outdated and inaccurate information, I hope to help you share the results of these days.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.