Python crawler Development "1th", "Machine vision and Tesseract"

Source: Internet
Author: User
Tags drupal python script tesseract ocr

Orc Library Overview

Python has always been a very good language for tasks such as reading and processing images, image-related machine learning, and creating images. Although there are many libraries that can be used for image processing, here we only highlight: Tesseract

1.Tesseract

Tesseract is an OCR library that is currently sponsored by Google (Google is also a company known for its OCR and machine learning technologies). Tesseract is currently recognized as the best and most accurate open source OCR system. In addition to its extremely high accuracy, the tesseract is also highly flexible. It can be trained to recognize any font and can recognize any Unicode character.

2.Tesseract Installing the Windows system

Download the executable installation file Https://code.google.com/p/tesseract-ocr/downloads/list installation.

To use the Tesseract function, you need to set up a new environment variable in the system $TESSDATA_PREFIX , let Tesseract know where the trained data file is stored, and then make a copy of the Tessdata data file and put it in the Tesseract directory.

    • Similarly on Windows systems, you can set environment variables by following this line of command:#setx TESSDATA_PREFIX C:\Program Files\Tesseract OCR\Tesseract

3.pytesseract Installation

Tesseract is a Python command-line tool, not a library imported through an import statement. Once installed, run outside of Python with the tesseract command, but we can install the Python version of the Tesseract library via PIP:pip install pytesseract

Run tesseract with the following command to read the file and write the results to a text file: ' tesseract test.jpg text

Python code
Import pytesseractfrom PIL Import imageimage = Image.open (' test.jpg ') Text = pytesseract.image_to_string (Image) print Text Run Result: This is some text, written in Arial, that would be read bytesseract. Here is some symbols: [email protected]#$% "&* ()
Threshold filtering and noise reduction for images

You can use a Python script to clean up a picture when you encounter an issue that is difficult to identify. Using the Pillow library, you can create a threshold filter to remove the background color of the gradient, leaving only the text to make the picture clearer and easier to read tesseract:

From PIL import image import Subprocessdef cleanfile (FilePath, Newfilepath):     image = Image.open (filePath)    # Filter the image by threshold and save    image = Image.point (lambda x:0 if x<143 else 255)         Image.Save (Newfilepath)    # Call the system's tesseract command to OCR the image         Subprocess.call (["Tesseract", Newfilepath, "output"])    # Open File read result files    = Open ("Output.txt", ' R ')         Print (File.read ())     file.close () cleanfile ("Text2.jpg", "Text2clean.png")
Grab text from a Web site picture

Use Tesseract to read the text on the image on the hard drive, but when we combine it with a web crawler, it becomes a powerful tool.

To grab a text step from a Web site picture:

1. Open the reader,

2. URL link to collect images,

3. Download the image,

4. Identify the picture,

5. Finally print the text for each picture.

Import timefrom urllib.request import urlretrieve import subprocessfrom Selenium import webdriver# Create a new selenium Driverdriver = Webdriver. PHANTOMJS () # Use Selenium to try the Firefox browser: # driver = Webdriver. Firefox () driver.get ("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200") # Click the book Preview button driver.find_element_by_id ("Sitblogoimg"). Click () imageList = set () # Wait for the page to load to complete time.sleep (5) # When the right arrow can be clicked, it starts to flip while " Pointer "in driver.find_element_by_id (" Sitbreaderrightpageturner "). Get_attribute (" style "): driver.find_element_by _id ("Sitbreaderrightpageturner"). Click () Time.sleep (2) # Gets a new page loaded (multiple pages can be loaded at one time, but duplicate pages cannot be loaded into the collection) pages = driver.fi        Nd_elements_by_xpath ("//div[@class = ' pageimage ']/div/img") for page in pages:image = Page.get_attribute ("src") Imagelist.add (image) Driver.quit () # Process the image we collected with Tesseract URL link for image in sorted (imageList): # Save Picture Urlretriev E (image, "page.jpg") p = subprocess. Popen (["Tesseract", "page.jpg", "page"], stdout=subprocess. Pipe,stderr=subprocess. PIPE) F = open ("Page.txt", "R") p.wait () print (F.read ())

Verification Code Processing case:

A site-generated captcha picture typically has the following properties:

    • They are images that are dynamically generated by the server-side program. The src attribute of a captcha picture may not be the same as a normal picture, for example , but it can be downloaded and processed like any other image.
    • The answer to the picture is stored in the server-side database.
    • Many verification codes have a time limit, and if you do not resolve them for too long you will fail.

Verification Code Processing method:

1. First download the verification code picture to the hard disk, clean it up,

2. Then use Tesseract to process the picture,

3. Finally returns the recognition results that meet the site's requirements.

#!/usr/bin/env python#-*-coding:utf-8-*-import requestsimport timeimport pytesseractfrom PIL import Imagefrom bs4 impor T beautifulsoupdef captcha (data): With open (' captcha.jpg ', ' WB ') as Fp:fp.write (data) time.sleep (1) image = Image.open ("captcha.jpg") Text = pytesseract.image_to_string (Image) print "The Verification code after machine recognition is:" + text command = raw_in Put ("Please enter Y to agree to use, press other keys to reenter yourself:") if (Command = = "Y" or command = = "Y"): Return text Else:return Raw_inpu T (' Input code: ') def zhihulogin (Username,password): # Build a Session object that holds the cookie value Sessiona = requests. Session () headers = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0 '} # First gets the page information, finds the data that needs to be post (and the cookie of the current page is recorded) HTML = Sessiona.get (' Https://www.zhih u.com/#signin ', headers=headers). Content # Find the input tag with the Name property value of _XSRF, and take out the value in values _XSRF = BeautifulSoup (HTML, ' lxml ' ). Find (' input ', attrs={' name ': ' _XSRF '}). Get (' value ') # Remove the verification code, the value after R is the Unix timestamp, time.time ()   Captcha_url = ' https://www.zhihu.com/captcha.gif?r=%d&type=login '% (time.time () * +) response = Sessiona.get        (Captcha_url, headers = headers) data = {"_XSRF": _xsrf, "email": Username, "password":p Assword, "Remember_me": True, "Captcha": Captcha (response.content)} response = Sessiona.post (' https://www.zhihu.c Om/login/email ', data = data, headers=headers) print response.text response = Sessiona.get (' Https://www.zhihu.com/pe Ople/maozhaojun/activities ', headers=headers) print response.textif __name__ = = "__main__": #username = Raw_input ("U Sername ") #password = raw_input (" password ") zhihulogin (' [email protected] ', ' alaxxxxime ')

There are two exceptions that can cause this program to fail to run.

In the first case, if tesseract recognizes the result from the CAPTCHA picture as not four characters (because all valid answers to the verification code in the training sample must be four characters), the result is not committed and the program fails.

The second case is that although the result of the recognition is four characters, it is submitted to the form, but the server does not approve the result and the program still fails.

During the actual operation,

The first scenario is approximately 50%, and the program does not submit to the form, and the program ends directly and prompts for a verification code identification error.

The probability of the second anomaly occurring is about 20%, the probability of four pairs is approximately 30% (the correct rate of recognition for each letter is approximately 80%, and if five characters are recognized, the correct total probability is 32.8%).

Training Tesseract

The popular PHP Content management System Drupal has a well-known captcha module (Https://www.drupal.org/project/captcha, which generates verification codes of varying difficulty.

To train tesseract to recognize a type of text, you need to provide tesseract with samples of each character in different forms.

Tesseract's Documentation: Https://github.com/tesseract-ocr/tesseract/wiki

Python crawler Development "1th", "Machine vision and Tesseract"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.