In-depth learning methods of using OCR algorithms to identify text in pictures

Last Update:2018-08-08 Source: Internet

Author: User

Tags image processing library tesseract ocr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The company has a need, simple point of need to recognize from a picture of Chinese, through Python to achieve, of course, other programs can be done, as long as the implementation, while the small part of the main learning Python, so the mention of Python. A small white on the internet to travel a day, finally found a trace of silk thoughts, specially in this share, I hope the great God put forward valuable comments.

Today is still learning OCR algorithm, but it seems to be difficult to find their own, and python implementation of the image of Chinese recognition method is not much, so I intend to record the process of their own learning. Today see a rookie can use the open source project, that is, OCR open Source project Tesseract, perhaps for the rookie of me, the best, can try this project, can also look at the source code, why not!

OCR open source projects a lot, give everyone a link, this link lists the existing more famous OCR open source projects, links are as follows:

Https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software

From the above rankings can be seen, Tesseract is ranked in the first place! So the following is a serious study of tesseract. First introduce the tesseract, then install, test, understand its shortcomings and so on.

     Tesseract's OCR engine is currently published as an open source project in Google Project, and its Project home page is here to view HTTPS://GITHUB.COM/TESSERACT-OCR,     which supports Chinese OCR and provides a command-line tool. The corresponding package in Python is pytesseract. With this tool we can identify the text on the image.

A Tesseract installation test uses 1.1 small development environments as follows:

Windows 7
Python 3.6
Pycharm

1.2 Download the installation package

First download the installation version of Tesseract under Windows. (because in foreign access can not Google, so others fq download down, here to everyone Baidu network link)

Http://pan.baidu.com/s/1i56Uxlr

According to Https://github.com/tesseract-ocr/tesseract/wiki, find an unofficial installation package, as if you only see the 64-bit installation package http://digi.bib.uni-mannheim.de/ Tesseract/tesseract-ocr-setup-4.00.00dev.exe, install directly after downloading, but remember your installation directory, we will configure environment variables to use.

If you do not do the English text recognition, you also need to download the other language identification package https://github.com/tesseract-ocr/tesseract/wiki/Data-Files.

Simplified Chinese identification kit: Https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata

Traditional Chinese identification kit: Https://github.com/tesseract-ocr/tessdata/raw/4.0/chi_tra.traineddata

Install Python-corresponding package: pytesseract

Pip Install Pytesseract

1.3 Installing Tesseract

Download it and install it all the way next, and then find its console bootstrapper in the Start menu, as shown in:

There are two ways to use Tesseract OCR:    dynamic Library mode libtesseract and execution program mode Tesseract.exe    Small series use is the second way, also convenient for Python calls (mainly small series of comparative dishes).

1.4 Test English character recognition

The above installation package comes with the already trained English-Latin identification Data ~ So let's test the recognition of the English characters. The following image is identified:

1.4.1 Put the above image in the Tesseract installation directory, as shown in:

1.4.2 Open the Console window mentioned above, as shown in:

1.4.3 Enter the command in the window: "Tesseract.exe 0.jpg 1", and enter as shown:

01.jpg represents the source file to be recognized, 1 represents the output file name, the default output format is TXT file format! Note that the above Lang is before the-L instead of-1!

1.4.4 Let's take a look at 01.jpg photos, such as:

1.4.5 a 1.txt file is generated in the installation directory and the recognition results are as follows:

Since the installation is good, the test can be, we will do the actual combat, do image recognition.

Two python using TESSERACT-OCR to do image recognition

Although a line of code can be done to identify the image, but we need to import two libraries, this is someone else's well-written packaged library files. Only by importing People's library, we can recognize a line of code to implement picture text.

Here we need to use two libraries: Pytesseract and PiL
We also need to install the recognition engine TESSERACT-OCR

Installation of 2.1 pytesseract and PiL

The two packages can be installed with PIP

2.2 about the concept of the relevant module：

Python-tesseract is the Python wrapper class for optical character recognition Tesseract OCR engines. Ability to read any regular picture file (JPG, GIF, PNG, TIFF, etc.) and decode it into a readable language. No pro files are created during OCR processing

PIL (Python Imaging library) is the most commonly used image processing library in Python, currently available in version 1.1.7, where we can download the study and find information.

The image class is a very important class in the PIL library, which can be used to create an instance with three methods, such as loading image files directly, reading processed images, and images obtained by fetching.

Python processing of images is more common is to use Pytesseract identification verification code, to install the Pytesseract library, you must first install its dependent PIL and TESSERACT-OCR, where PIL as an image processing library, The TESSERACT-OCR behind is Google's OCR recognition engine.

    Download Link: http://www.waitalone.cn/python-php-ocr.html       the link document describes how to configure the relevant environment, as well as the identification code of the Python code, summed up in three steps: The    installation of PIL.exe;                           install Tesseract-ocr-setup.exe;                           Install pip install Pytesseract

2.3 Settlement Python2. x PIL cannot be used in python3.x

At present, the latest official version of PIL is 1.1.7, the supported version is Python 2.5, 2.6, 2.7, does not support Python3, after querying Python3. x Replace with pillow, enter the DOS command line window, typing the following code

Pip Install Pillow

Prompt installation is successful, then run the program without problems.

2.4 Formally identify the text in the picture (including Simple English and complex English) 2.4.1 below to get to the point, we identify the following things, see figure (Two cases):

2.4.2 python The code is as follows:

Import pytesseractfrom PIL Import image# open captcha picture image = Image.open (' 02.jpg ') #加载一片防止报错, here you can omit the image.load () #调用show来展示图片, Debugging can be omitted here image.show () Text = pytesseract.image_to_string (Image.open (' 02.jpg ')) print (text)

2.4.3 View the results of the run, after running, the results are as follows:

2.4.5 Summary

From the above we can find that the result of running the code, the simple picture recognition rate is still possible, but complex words ..., so I hope to continue to study, continue to find useful libraries.

Introduction to the usage of the three image modules (EXT) 3.1 Introduction

Image processing is a very wide-spread technology, and Python, which has a very rich third-party expansion library, will certainly not miss this feast. PIL (Python Imaging library) is the most commonly used image processing library in Python, currently available in version 1.1.7, where we can download the study and find information.

3.2 Use

Import the Image module. An image file can then be loaded with the open method in the image class. If loading a file fails, it causes a IOError, and if there is no return error, the Open function returns an Image object. Now, we can examine the contents of the file through some object properties, namely:

1 >>> import Image2  >>> im = Image.open ("j.jpg") 3  >>> print Im.format, Im.size, Im.mode4 JPEG (a) RGB

Here are three properties, which we'll look at each.

Format: Identifies the source format of the image and is set to the None value if the file is not read from the file.

Size: Returns a tuple with two elements with a value of width and height in pixels.

Mode:rgb (True color image), plus, L (luminance), CMTK (pre-press image).

3.3 Simple Geometrical changes

Color space transformation.

Convert (): This function can be used to convert an image to a different color mode.

Image enhancement.

Filters: You can use the filter function in the ImageFilter module to use a series of predefined enhancement filters in the module.

>>>out = Im.resize ((45))                      >>>out = im.rotate                             #逆时针旋转 degree angle. >>>out = Im.transpose (image.flip_left_right)       #左右对换. >>>out = Im.transpose (image.flip_top_bottom)       #上下对换. >>>out = Im.transpose (image.rotate_90)             #旋转 90 degree angle. >>>out = Im.transpose (image.rotate_180)            #旋转 180 degree angle. >>>out = Im.transpose (image.rotate_270)            #旋转 270 degree angle.

In-depth learning methods of using OCR algorithms to identify text in pictures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More