Python under Tesseract OCR engine and installation Introduction

Source: Internet
Author: User
Tags tesseract ocr

1, Tesseract Introduction

Tesseract is a Google-supported open source OCR project, its Project address: Https://github.com/tesseract-ocr/tesseract, the current source code can be downloaded here.

There are two ways to actually use Tesseract OCR:1-Dynamic library mode libtesseract 2-Execute program way. tesseract EXE

Because I am also a Python rookie one, so the way 1 temporarily will not, had to take the way 2.

2, tesseract installation package download

Release version of Tesseract:https://github.com/tesseract-ocr/tesseract/wiki/Downloads, here are some things to note:

Currently, there is no official Windows installer for newer versions.

This means that the official does not provide the latest version of the Windows Platform installation package, only relatively slightly older version 3.02.02, its: https://sourceforge.net/projects/tesseract-ocr-alt/files/.

The latest version of versions 3.03 and 3.05, are three-party maintenance and management of the installation package, there are several distribution agencies, respectively:

3rd party Windows EXE ' S/installer
    • Binaries compiled by @egorpugin (ref issue # 209) https://www.dropbox.com/s/8t54mz39i58qslh/ Tesseract-3.05.00dev-win32-vc19.zip?dl=1

      You have the to install the VC2015 x86 redist from Microsoft.com on order to run them. Leptonica is built with all libs except for libjp2k.

    • Https://github.com/UB-Mannheim/tesseract/wiki

    • http://domasofan.spdns.eu/tesseract/

To summarize:

1, the official release of the 3.02 version: http://downloads.sourceforge.net/project/tesseract-ocr-alt/tesseract-ocr-setup-3.02.02.exe?r= Https%3a%2f%2fsourceforge.net%2fprojects%2ftesseract-ocr-alt%2ffiles%2f&ts=1464880498&use_mirror=jaist

2. Version 3.05 of the University of Mannheim, Germany, Http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe

3, Simon Eigeldinger (@DomasoFan) Maintenance of another version: http://3.onj.me/tesseract/, it is commendable that the site has a more detailed description.

If the above version is found to be unable to download in the download, you can first try the Thunderbolt, second, may need to FQ.

I am using the official release of the 3.02 version, that is, link 1.

3, Tesseract OCR use instructions

After installation, the default directory C:\Program Files (x86) \TESSERACT-OCR, you need to put this path in your operating system path search path, or later use will be inconvenient.

You can see tesseract.exe this command-line executor under installation directory C:\Program Files (x86) \TESSERACT-OCR.

The tesseract syntax is as follows:

For example: Tesseract 1.png output-l ENG-PSM 7, means to take a single line of text, using the English font to identify 1.png This picture file, the recognition results output to the current directory Output.txt file.

1D:\python\lnypcg\test>tesseract2Usage: TesseractImageName outputbase [-L lang] [-PSM pagesegmode] [configfile ...]3 4 Pagesegmode values are:50 = Orientation and script detection (OSD) only.61 = Automatic page segmentation with OSD.72 = Automatic page segmentation, but no OSD,or OCR83 = Fully Automatic page segmentation, but no OSD. (Default)94 = Assume a single column of text of variable sizes.Ten5 = Assume a single uniform block of vertically aligned text. One6 = Assume a single uniform block of text. A7 = Treat The image as a single text line.#-psm 7 indicates single-line text recognition -8 = Treat The image as a single word. -9 = Treat The image as a single word in a circle. theTen = Treat the image as a single character. --L lang AND/OR-PSM pagesegmode must occur before anyconfigfile.#-l Eng represents the use of English recognition -  - Single options: + - v--version:version info ---list-langs:list Available languages forTesseract engine

4, Tesseract OCR use example

Now there is a gray-scale processing of the code file, the command line calls the Tesseract.exe implementation of the default, and the recognized text output to the Output.txt text file.

(How to grayscale processing, in Python can use the PIL library, first dig a hole, next write.) )

1D:\python\lnypcg\test>dir2 the volume in drive D is not labeled. 3 the serial number of the volume is 36d9-cdc74 5D:\python\lnypcg\Catalog of Test6 72016-06-02 23:28 <DIR>.82016-06-02 23:28 <DIR>..92016-06-02 22:02 462 1.PNGTen1 Files 462bytes One2 Listings 25,733,357,568Available Bytes A  -D:\python\lnypcg\test>tesseract 1.PNG output-l Eng -Tesseract Open Source OCR Engine v3.02With Leptonica the  -D:\python\lnypcg\test>typeOutput.txt -7572 -  +  -D:\python\lnypcg\test>

Summary, Tesseract is a very good OCR engine, the current problem is the latest Chinese information relatively small, outdated, inaccurate information, the results of the past few days to share to everyone, hope to help you.

Python under Tesseract OCR engine and installation Introduction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.