Crawler-Text Recognition and crawler Recognition

Source: Internet
Author: User

Crawler-Text Recognition and crawler Recognition

Machine Vision

From Google's self-driving cars to vending machines that can identify fake money, machine vision has always been a widely used and far-reaching field with a magnificent vision.

Here we will focus on a branch of Machine Vision: Text Recognition. This section describes how to use Python libraries to recognize and use text in online images.

We can easily read the text in images, but it is very difficult for machines to read these images. Using this kind of human user can normally read images that are not readable by most storage devices, the verification code (CAPTCHA) appears. The difficulty of reading the Verification Code varies greatly.

Translating an image into a text is generally called Optical Character Recognition (Optical Character Recognition, OCR ). There are not many underlying databases that can implement OCR. At present, many libraries use the same underlying OCR library or are customized on it.

OCR library Overview

Python has always been an excellent language for reading and processing images, machine learning tasks that differ from images, and image creation tasks. Although there are many libraries for image processing, here we will only introduce the Tesseract library.

Tesseract

Tesseract is an OCR library currently sponsored by Google. Tesseract is currently recognized as the best and most accurate open-source OCR system. In addition to high accuracy, Tesseract also has high flexibility. It can recognize any font or any Unicode character through training.

Install Tesseract: Windows

Download and install the executable Installation File.

Install pytesseract

Tesseract is a Python command line tool, not a library imported through the import Statement. After installation, run the tesseract Command outside Python, but we can install the Tesseract library that supports the Python version through pip:

Pip install pytesseract

Process Standard Text

Most of the text you want to process is relatively clean and formatted. The text in the format of Huo Yingdong usually has the following features:

  • Use a unified standard font (excluding handwritten, cursive, or very fancy font), copy or take a photo, but the font is clear and there are no unnecessary traces or stains
  • Neatly arranged, no skewed words
  • Not beyond the image range, there is no incomplete, or tightly pasted on the Image Edge

Some text format problems can be solved during image preprocessing. For example, you can convert an image to a grayscale image, adjust the brightness and contrast, and crop and rotate the image as needed.

Example:

English:

F:\DE209_F>tesseract english.jpg textTesseract Open Source OCR Engine v4.00.00alpha with LeptonicaF:\DE209_F>type text.txtThis is some text, written in Arial, that will be read byTesseract. Here are some symbols: !@#$%"&*()

The accuracy of the recognition results is still quite high.

Implemented Using Python code

English:

Chinese:

#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import pytesseractfrom PIL import Image # Open the image: English Image = image.open('engish.jpg ') # OCR recognition: lang default ENGLISH text = pytesseract. image_to_string (image) # print the recognized text print (text) # I am the split line print ("*" * 30) # Open the image: English image = Image.open('china.png ') # OCR recognition: lang specifies Chinese text = pytesseract. image_to_string (image, lang = 'chi _ sim ') # print the recognized text print (text)

Running result

This is some text, written in Arial, that will be read byTesseract. Here are some symbols :! @ # $ % "&*() ******************************* China, China, the people, and the country

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.