The company has a need, simple point of need to recognize from a picture of Chinese, through Python to achieve, of course, other programs can be done, as long as the implementation, while the small part of the main learning Python, so the mention of Python. A small white on the internet to travel a day, finally found a trace of silk thoughts, specially in this share, I hope the great God put forward valuable comments.
Today is still learning OCR algorithm, but it seems to be difficult to find their own, and python implementation of the image of Chinese recognition method is not much, so I intend to record the process of their own learning. Today see a rookie can use the open source project, that is, OCR open Source project Tesseract, perhaps for the rookie of me, the best, can try this project, can also look at the source code, why not!
OCR open source projects a lot, give everyone a link, this link lists the existing more famous OCR open source projects, links are as follows:
Https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software
From the above rankings can be seen, Tesseract is ranked in the first place! So the following is a serious study of tesseract. First introduce the tesseract, then install, test, understand its shortcomings and so on.
Tesseract's OCR engine is currently published as an open source project in Google Project, and its Project home page is here to view HTTPS://GITHUB.COM/TESSERACT-OCR, which supports Chinese OCR and provides a command-line tool. The corresponding package in Python is pytesseract. With this tool we can identify the text on the image.
A Tesseract installation test uses 1.1 small development environments as follows:
- Windows 7
- Python 3.6
- Pycharm
1.2 Download the installation package
First download the installation version of Tesseract under Windows. (because in foreign access can not Google, so others fq download down, here to everyone Baidu network link)
Http://pan.baidu.com/s/1i56Uxlr
According to Https://github.com/tesseract-ocr/tesseract/wiki, find an unofficial installation package, as if you only see the 64-bit installation package http://digi.bib.uni-mannheim.de/ Tesseract/tesseract-ocr-setup-4.00.00dev.exe, install directly after downloading, but remember your installation directory, we will configure environment variables to use.
If you do not do the English text recognition, you also need to download the other language identification package https://github.com/tesseract-ocr/tesseract/wiki/Data-Files.
Simplified Chinese identification kit: Https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata
Traditional Chinese identification kit: Https://github.com/tesseract-ocr/tessdata/raw/4.0/chi_tra.traineddata
Install Python-corresponding package: pytesseract
Pip Install Pytesseract
1.3 Installing Tesseract
Download it and install it all the way next, and then find its console bootstrapper in the Start menu, as shown in:
There are two ways to use Tesseract OCR: dynamic Library mode libtesseract and execution program mode Tesseract.exe Small series use is the second way, also convenient for Python calls (mainly small series of comparative dishes).
1.4 Test English character recognition
The above installation package comes with the already trained English-Latin identification Data ~ So let's test the recognition of the English characters. The following image is identified:
1.4.1
Put the above image in the Tesseract installation directory, as shown in:
1.4.2
Open the Console window mentioned above, as shown in:
1.4.3
Enter the command in the window: "Tesseract.exe 0.jpg 1", and enter as shown:
01.jpg represents the source file to be recognized, 1 represents the output file name, the default output format is TXT file format! Note that the above Lang is before the-L instead of-1!
1.4.4
Let's take a look at 01.jpg photos, such as:
1.4.5
a 1.txt file is generated in the installation directory and the recognition results are as follows:
Since the installation is good, the test can be, we will do the actual combat, do image recognition.
Two python using TESSERACT-OCR to do image recognition
Although a line of code can be done to identify the image, but we need to import two libraries, this is someone else's well-written packaged library files. Only by importing People's library, we can recognize a line of code to implement picture text.
- Here we need to use two libraries: Pytesseract and PiL
- We also need to install the recognition engine TESSERACT-OCR
Installation of 2.1 pytesseract and PiL
The two packages can be installed with PIP
2.2 about the concept of the relevant module:
Python-tesseract is the Python wrapper class for optical character recognition Tesseract OCR engines. Ability to read any regular picture file (JPG, GIF, PNG, TIFF, etc.) and decode it into a readable language. No pro files are created during OCR processing
PIL (Python Imaging library) is the most commonly used image processing library in Python, currently available in version 1.1.7, where we can download the study and find information.
The image class is a very important class in the PIL library, which can be used to create an instance with three methods, such as loading image files directly, reading processed images, and images obtained by fetching.
Python processing of images is more common is to use Pytesseract identification verification code, to install the Pytesseract library, you must first install its dependent PIL and TESSERACT-OCR, where PIL as an image processing library, The TESSERACT-OCR behind is Google's OCR recognition engine.
Download Link: http://www.waitalone.cn/python-php-ocr.html the link document describes how to configure the relevant environment, as well as the identification code of the Python code, summed up in three steps: The installation of PIL.exe; install Tesseract-ocr-setup.exe; Install pip install Pytesseract
2.3 Settlement Python2. x PIL cannot be used in python3.x
At present, the latest official version of PIL is 1.1.7, the supported version is Python 2.5, 2.6, 2.7, does not support Python3, after querying Python3. x Replace with pillow, enter the DOS command line window, typing the following code
Pip Install Pillow
Prompt installation is successful, then run the program without problems.
2.4 Formally identify the text in the picture (including Simple English and complex English) 2.4.1
below to get to the point, we identify the following things, see figure (Two cases):
2.4.2 python
The code is as follows:
Import pytesseractfrom PIL Import image# open captcha picture image = Image.open (' 02.jpg ') #加载一片防止报错, here you can omit the image.load () #调用show来展示图片, Debugging can be omitted here image.show () Text = pytesseract.image_to_string (Image.open (' 02.jpg ')) print (text)
2.4.3 View the results of the run, after running, the results are as follows:
2.4.5 Summary
From the above we can find that the result of running the code, the simple picture recognition rate is still possible, but complex words ..., so I hope to continue to study, continue to find useful libraries.
Introduction to the usage of the three image modules (EXT) 3.1 Introduction
Image processing is a very wide-spread technology, and Python, which has a very rich third-party expansion library, will certainly not miss this feast. PIL (Python Imaging library) is the most commonly used image processing library in Python, currently available in version 1.1.7, where we can download the study and find information.
The image class is a very important class in the PIL library, which can be used to create an instance with three methods, such as loading image files directly, reading processed images, and images obtained by fetching.
3.2 Use
Import the Image module. An image file can then be loaded with the open method in the image class. If loading a file fails, it causes a IOError, and if there is no return error, the Open function returns an Image object. Now, we can examine the contents of the file through some object properties, namely:
1 >>> import Image2 >>> im = Image.open ("j.jpg") 3 >>> print Im.format, Im.size, Im.mode4 JPEG (a) RGB
Here are three properties, which we'll look at each.
Format: Identifies the source format of the image and is set to the None value if the file is not read from the file.
Size: Returns a tuple with two elements with a value of width and height in pixels.
Mode:rgb (True color image), plus, L (luminance), CMTK (pre-press image).
3.3 Simple Geometrical changes
Color space transformation.
Convert (): This function can be used to convert an image to a different color mode.
Image enhancement.
Filters: You can use the filter function in the ImageFilter module to use a series of predefined enhancement filters in the module.
>>>out = Im.resize ((45)) >>>out = im.rotate #逆时针旋转 degree angle. >>>out = Im.transpose (image.flip_left_right) #左右对换. >>>out = Im.transpose (image.flip_top_bottom) #上下对换. >>>out = Im.transpose (image.rotate_90) #旋转 90 degree angle. >>>out = Im.transpose (image.rotate_180) #旋转 180 degree angle. >>>out = Im.transpose (image.rotate_270) #旋转 270 degree angle.
In-depth learning methods of using OCR algorithms to identify text in pictures