Crawler-Text Recognition and crawler Recognition
Machine Vision
From Google's self-driving cars to vending machines that can identify fake money, machine vision has always been a widely used and far-reaching field with a magnificent vision.
Here we will focus on a branch of Machine Vision: Text Recognition. This section describes how to use Python libraries to recognize and use text in online images.
We can easily read the text in images, but it is very difficult for machines to read these images. Using this kind of human user can normally read images that are not readable by most storage devices, the verification code (CAPTCHA) appears. The difficulty of reading the Verification Code varies greatly.
Translating an image into a text is generally called Optical Character Recognition (Optical Character Recognition, OCR ). There are not many underlying databases that can implement OCR. At present, many libraries use the same underlying OCR library or are customized on it.
OCR library Overview
Python has always been an excellent language for reading and processing images, machine learning tasks that differ from images, and image creation tasks. Although there are many libraries for image processing, here we will only introduce the Tesseract library.
Tesseract
Tesseract is an OCR library currently sponsored by Google. Tesseract is currently recognized as the best and most accurate open-source OCR system. In addition to high accuracy, Tesseract also has high flexibility. It can recognize any font or any Unicode character through training.
Install Tesseract: Windows
Download and install the executable Installation File.
Install pytesseract
Tesseract is a Python command line tool, not a library imported through the import Statement. After installation, run the tesseract Command outside Python, but we can install the Tesseract library that supports the Python version through pip:
Pip install pytesseract
Process Standard Text
Most of the text you want to process is relatively clean and formatted. The text in the format of Huo Yingdong usually has the following features:
- Use a unified standard font (excluding handwritten, cursive, or very fancy font), copy or take a photo, but the font is clear and there are no unnecessary traces or stains
- Neatly arranged, no skewed words
- Not beyond the image range, there is no incomplete, or tightly pasted on the Image Edge
Some text format problems can be solved during image preprocessing. For example, you can convert an image to a grayscale image, adjust the brightness and contrast, and crop and rotate the image as needed.
Example:
English:
F:\DE209_F>tesseract english.jpg textTesseract Open Source OCR Engine v4.00.00alpha with LeptonicaF:\DE209_F>type text.txtThis is some text, written in Arial, that will be read byTesseract. Here are some symbols: !@#$%"&*()
The accuracy of the recognition results is still quite high.
Implemented Using Python code
English:
Chinese:
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import pytesseractfrom PIL import Image # Open the image: English Image = image.open('engish.jpg ') # OCR recognition: lang default ENGLISH text = pytesseract. image_to_string (image) # print the recognized text print (text) # I am the split line print ("*" * 30) # Open the image: English image = Image.open('china.png ') # OCR recognition: lang specifies Chinese text = pytesseract. image_to_string (image, lang = 'chi _ sim ') # print the recognized text print (text)
Running result
This is some text, written in Arial, that will be read byTesseract. Here are some symbols :! @ # $ % "&*() ******************************* China, China, the people, and the country