Python under Tesseract OCR engine and installation Introduction

Last Update:2016-06-03 Source: Internet

Author: User

Tags tesseract ocr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, Tesseract Introduction

Tesseract is a Google-supported open source OCR project, its Project address: Https://github.com/tesseract-ocr/tesseract, the current source code can be downloaded here.

There are two ways to actually use Tesseract OCR:1-Dynamic library mode libtesseract 2-Execute program way. tesseract EXE

Because I am also a Python rookie one, so the way 1 temporarily will not, had to take the way 2.

2, tesseract installation package download

Release version of Tesseract:https://github.com/tesseract-ocr/tesseract/wiki/Downloads, here are some things to note:

Currently, there is no official Windows installer for newer versions.

This means that the official does not provide the latest version of the Windows Platform installation package, only relatively slightly older version 3.02.02, its: https://sourceforge.net/projects/tesseract-ocr-alt/files/.

The latest version of versions 3.03 and 3.05, are three-party maintenance and management of the installation package, there are several distribution agencies, respectively:

3rd party Windows EXE ' S/installer

Binaries compiled by @egorpugin (ref issue # 209) https://www.dropbox.com/s/8t54mz39i58qslh/ Tesseract-3.05.00dev-win32-vc19.zip?dl=1
You have the to install the VC2015 x86 redist from Microsoft.com on order to run them. Leptonica is built with all libs except for libjp2k.
Https://github.com/UB-Mannheim/tesseract/wiki
http://domasofan.spdns.eu/tesseract/

To summarize:

1, the official release of the 3.02 version: http://downloads.sourceforge.net/project/tesseract-ocr-alt/tesseract-ocr-setup-3.02.02.exe?r= Https%3a%2f%2fsourceforge.net%2fprojects%2ftesseract-ocr-alt%2ffiles%2f&ts=1464880498&use_mirror=jaist

2. Version 3.05 of the University of Mannheim, Germany, Http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe

3, Simon Eigeldinger (@DomasoFan) Maintenance of another version: http://3.onj.me/tesseract/, it is commendable that the site has a more detailed description.

If the above version is found to be unable to download in the download, you can first try the Thunderbolt, second, may need to FQ.

I am using the official release of the 3.02 version, that is, link 1.

3, Tesseract OCR use instructions

After installation, the default directory C:\Program Files (x86) \TESSERACT-OCR, you need to put this path in your operating system path search path, or later use will be inconvenient.

You can see tesseract.exe this command-line executor under installation directory C:\Program Files (x86) \TESSERACT-OCR.

The tesseract syntax is as follows:

For example: Tesseract 1.png output-l ENG-PSM 7, means to take a single line of text, using the English font to identify 1.png This picture file, the recognition results output to the current directory Output.txt file.

1D:\python\lnypcg\test>tesseract2Usage: TesseractImageName outputbase [-L lang] [-PSM pagesegmode] [configfile ...]3 4 Pagesegmode values are:50 = Orientation and script detection (OSD) only.61 = Automatic page segmentation with OSD.72 = Automatic page segmentation, but no OSD,or OCR83 = Fully Automatic page segmentation, but no OSD. (Default)94 = Assume a single column of text of variable sizes.Ten5 = Assume a single uniform block of vertically aligned text. One6 = Assume a single uniform block of text. A7 = Treat The image as a single text line.#-psm 7 indicates single-line text recognition -8 = Treat The image as a single word. -9 = Treat The image as a single word in a circle. theTen = Treat the image as a single character. --L lang AND/OR-PSM pagesegmode must occur before anyconfigfile.#-l Eng represents the use of English recognition -  - Single options: + - v--version:version info ---list-langs:list Available languages forTesseract engine

4, Tesseract OCR use example

Now there is a gray-scale processing of the code file, the command line calls the Tesseract.exe implementation of the default, and the recognized text output to the Output.txt text file.

(How to grayscale processing, in Python can use the PIL library, first dig a hole, next write.) ）

1D:\python\lnypcg\test>dir2 the volume in drive D is not labeled. 3 the serial number of the volume is 36d9-cdc74 5D:\python\lnypcg\Catalog of Test6 72016-06-02 23:28 <DIR>.82016-06-02 23:28 <DIR>..92016-06-02 22:02 462 1.PNGTen1 Files 462bytes One2 Listings 25,733,357,568Available Bytes A  -D:\python\lnypcg\test>tesseract 1.PNG output-l Eng -Tesseract Open Source OCR Engine v3.02With Leptonica the  -D:\python\lnypcg\test>typeOutput.txt -7572 -  +  -D:\python\lnypcg\test>

Summary, Tesseract is a very good OCR engine, the current problem is the latest Chinese information relatively small, outdated, inaccurate information, the results of the past few days to share to everyone, hope to help you.

Python under Tesseract OCR engine and installation Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More