Crawler-Text Recognition and crawler Recognition

Last Update:2017-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Machine Vision

From Google's self-driving cars to vending machines that can identify fake money, machine vision has always been a widely used and far-reaching field with a magnificent vision.

Here we will focus on a branch of Machine Vision: Text Recognition. This section describes how to use Python libraries to recognize and use text in online images.

We can easily read the text in images, but it is very difficult for machines to read these images. Using this kind of human user can normally read images that are not readable by most storage devices, the verification code (CAPTCHA) appears. The difficulty of reading the Verification Code varies greatly.

Translating an image into a text is generally called Optical Character Recognition (Optical Character Recognition, OCR ). There are not many underlying databases that can implement OCR. At present, many libraries use the same underlying OCR library or are customized on it.

OCR library Overview

Python has always been an excellent language for reading and processing images, machine learning tasks that differ from images, and image creation tasks. Although there are many libraries for image processing, here we will only introduce the Tesseract library.

Tesseract

Tesseract is an OCR library currently sponsored by Google. Tesseract is currently recognized as the best and most accurate open-source OCR system. In addition to high accuracy, Tesseract also has high flexibility. It can recognize any font or any Unicode character through training.

Install Tesseract: Windows

Download and install the executable Installation File.

Install pytesseract

Tesseract is a Python command line tool, not a library imported through the import Statement. After installation, run the tesseract Command outside Python, but we can install the Tesseract library that supports the Python version through pip:

Pip install pytesseract

Process Standard Text

Most of the text you want to process is relatively clean and formatted. The text in the format of Huo Yingdong usually has the following features:

Use a unified standard font (excluding handwritten, cursive, or very fancy font), copy or take a photo, but the font is clear and there are no unnecessary traces or stains
Neatly arranged, no skewed words
Not beyond the image range, there is no incomplete, or tightly pasted on the Image Edge

Some text format problems can be solved during image preprocessing. For example, you can convert an image to a grayscale image, adjust the brightness and contrast, and crop and rotate the image as needed.

Example:

English:

F:\DE209_F>tesseract english.jpg textTesseract Open Source OCR Engine v4.00.00alpha with LeptonicaF:\DE209_F>type text.txtThis is some text, written in Arial, that will be read byTesseract. Here are some symbols: !@#$%"&*()

The accuracy of the recognition results is still quite high.

Implemented Using Python code

English:

Chinese:

#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import pytesseractfrom PIL import Image # Open the image: English Image = image.open('engish.jpg ') # OCR recognition: lang default ENGLISH text = pytesseract. image_to_string (image) # print the recognized text print (text) # I am the split line print ("*" * 30) # Open the image: English image = Image.open('china.png ') # OCR recognition: lang specifies Chinese text = pytesseract. image_to_string (image, lang = 'chi _ sim ') # print the recognized text print (text)

Running result

This is some text, written in Arial, that will be read byTesseract. Here are some symbols :! @ # $ % "&*() ******************************* China, China, the people, and the country

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawler-Text Recognition and crawler Recognition

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support