Introduction to the Ocr engine and installation of Tesseract in Python, tesseractocr
1. Introduction to Tesseract
Tesseract is an open source ocr project supported by google. Its Project address is https://github.com/tesseract-ocr/tesseract. the latest source code can be downloaded here.
Tesseract ocr can also be used in two ways: 1-dynamic librarylibtesseract
2-Program Execution Modetesseract
. Exe
Because I am also a python rookie, so method 1 does not currently, so I have to adopt method 2.
2. Download the Tesseract installation package
The release version of Tesseract: https://github.com/tesseract-ocr/tesseract/wiki/downloads. in this example, the following statement is used:
Currently, there is no official Windows installer for newer versions.
This means that the latest version of windows installation package is not officially provided. Only earlier versions include 3.02.02, which are https://sourceforge.net/projects/tesseract-ocr-alt/files /.
The latest versions 3.03 and 3.05 are third-party Maintenance and Management installation packages, which have several publishers:
3rd party Windows exe's/installer
Binaries compiled by @ egorpugin (ref issue #209) https://www.dropbox.com/s/8t54mz39i58qslh/tesseract-3.05.00dev-win32-vc19.zip? Dl = 1
You have to install VC2015 x86 redist from microsoft.com in order to run them. Leptonica is built with all libs before t for libjp2k.
Https://github.com/UB-Mannheim/tesseract/wiki
- Http://domasofan.spdns.eu/tesseract/
Summary:
1. officially released version 3.02: http://downloads.sourceforge.net/project/tesseract-ocr-alt/tesseract-ocr-setup-3.02.02.exe? R = https % 3A % 2F % 2Fsourceforge.net % 2 Fprojects % 2Ftesseract-ocr-alt % 2 Ffiles % 2F & ts = 1464880498 & use_mirror = jaist
2. Version 3.05 issued by the University of Manheim, Germany, http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.00dev.exe
3. Another version maintained by Simon eigelinger (@ DomasoFan): http://3.onj.me/tesseract/. the value is that there is a more detailed description in this network.
If the above version cannot be downloaded during download, You can first try thunder, and then you may need FQ.
I am using the officially released version 3.02, that is, link 1.
3. Instructions for using Tesseract ocr
After installation, the default directory C: \ Program Files (x86) \ Tesseract-OCR is used. You need to put this path in the path search path of your operating system, otherwise it will be inconvenient to use later.
In the installation directory C: \ Program Files (x86) \ Tesseract-OCR, you can see the command line execution Program tesseract.exe.
The syntax of tesseract is as follows:
For example, for example, tesseract 1.png output-l eng-psm 7, you can use the English library to identify the image file 1.png, and output the identification result to the output.txt file in the current directory.
1 D: \ python \ lnypcg \ test> tesseract 2 Usage: tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...] 3 4 pagesegmode values are: 5 0 = Orientation and script detection (OSD) only. 6 1 = Automatic page segmentation with OSD. 7 2 = Automatic page segmentation, but no OSD, or OCR 8 3 = Fully automatic page segmentation, but no OSD. (Default) 9 4 = Assume a single column of text of variable sizes.10 5 = Assume a single uniform block of vertically aligned text.11 6 = Assume a single uniform block of text.12 7 = Treat the image as a single text line. #-psm 7 indicates recognition of 13 8 = Treat the image as a single word.14 9 = Treat the image as a single word in a circle.15 10 = Treat the image as a single character.16 -l lang and/or-psm pagesegmode must occur before anyconfigfile. #-l eng stands for English recognition 17 18 Single options: 19-v -- version: version info20 -- list-langs: list available ages for tesseract engine
4. Tesseract ocr instance
In the plain text file.
(For how to perform grayscale processing, you can use the PIL library in python to dig a hole and write it next time .)
The volume in the 1 D: \ python \ lnypcg \ test> dir 2 drive D has no labels. The serial number of the 3 volume is 36D9-CDC7 4 5 D: \ python \ lnypcg \ test directory 6 7 <DIR>. 8 <DIR> .. 9 462 1.png 10 1 files 462 bytes 11 2 directories 25,733,357,568 available bytes 12 13 D: \ python \ lnypcg \ test> tesseract 1.png output-l eng14 Tesseract Open Source OCR Engine v3.02 with Leptonica15 16 D: \ python \ lnypcg \ test> type output.txt 17 757218 19 20 D: \ python \ lnypcg \ test>
Conclusion: tesseract is a good OCR engine. The current problem is that the latest Chinese materials are relatively small, and there are too many outdated and inaccurate information, I hope to help you share the results of these days.