Python爬蟲開發【第1篇】【機器視覺及Tesseract】

最後更新：2018-08-12 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：urllib drive pytho 設定知乎 ssi HERE return 內容

ORC庫概述

在讀取和處理映像、映像相關的機器學習以及建立映像等任務中，Python 一直都是非常出色的語言。雖然有很多庫可以進行影像處理，但在這裡我們只重點介紹：Tesseract

1.Tesseract

Tesseract 是一個 OCR 庫,目前由 Google 贊助(Google 也是一家以 OCR 和機器學習技術聞名於世的公司)。Tesseract 是目前公認最優秀、最精確的開源 OCR 系統。除了極高的精確度,Tesseract 也具有很高的靈活性。它可以通過訓練識別出任何字型，也可以識別出任何 Unicode 字元。

2.Tesseract安裝Windows 系統

下載可執行安裝檔案https://code.google.com/p/tesseract-ocr/downloads/list安裝。

要使用 Tesseract 的功能，需先在系統中設定一個新的環境變數 $TESSDATA_PREFIX，讓 Tesseract 知道訓練的資料檔案儲存在哪裡，然後搞一份tessdata資料檔案，放到Tesseract目錄下。

在 Windows 系統上也類似,你可以通過下面這行命令設定環境變數: #setx TESSDATA_PREFIX C:\Program Files\Tesseract OCR\Tesseract

3.pytesseract安裝

Tesseract 是一個 Python 的命令列工具，不是通過 import 語句匯入的庫。安裝後,要用 tesseract 命令在 Python 的外面運行，但我們可以通過 pip 安裝支援Python 版本的 Tesseract庫：pip install pytesseract

通過下面的命令運行 Tesseract，讀取檔案並把結果寫到一個文字檔中: `tesseract test.jpg text

Python代碼

import pytesseractfrom PIL import Imageimage = Image.open(‘test.jpg‘)text = pytesseract.image_to_string(image)print text運行結果：This is some text, written in Arial, that will be read byTesseract. Here are some symbols: [email protected]#$%"&*()

對圖片進行閾值過濾和降噪處理

遇到圖片難以識別的問題，可用 Python 指令碼對圖片進行清理。利用 Pillow 庫,可建立一個閾值過濾器來去掉漸層的背景色,只把文字留下來,從而讓圖片更加清晰,便於 Tesseract 讀取:

from PIL import Image import subprocessdef cleanFile(filePath, newFilePath):     image = Image.open(filePath)    # 對圖片進行閾值過濾,然後儲存    image = image.point(lambda x: 0 if x<143 else 255)         image.save(newFilePath)    # 調用系統的tesseract命令對圖片進行OCR識別         subprocess.call(["tesseract", newFilePath, "output"])    # 開啟檔案讀取結果    file = open("output.txt", ‘r‘)         print(file.read())     file.close()cleanFile("text2.jpg", "text2clean.png")

從網站圖片中抓取文字

用 Tesseract 讀取硬碟裡圖片上的文字,但當我們把它和網路爬蟲組合使用時,就能成為一個強大的工具。

從網站圖片中抓取文字步驟：

1. 開啟閱讀器,

2.收集圖片的 URL 連結,

3.下載圖片,

4.識別圖片,

5.最後列印每個圖片的文字。

import timefrom urllib.request import urlretrieve import subprocessfrom selenium import webdriver#建立新的Selenium driverdriver = webdriver.PhantomJS()# 用Selenium試試Firefox瀏覽器:# driver = webdriver.Firefox()driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")# 單擊圖書預覽按鈕 driver.find_element_by_id("sitbLogoImg").click() imageList = set()# 等待頁面載入完成time.sleep(5)# 當向右箭頭可以點擊時,開始翻頁while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):    driver.find_element_by_id("sitbReaderRightPageTurner").click()    time.sleep(2)    # 擷取已載入的新頁面(一次可以載入多個頁面,但是重複的頁面不能載入到集合中)     pages = driver.find_elements_by_xpath("//div[@class=‘pageImage‘]/div/img")     for page in pages:        image = page.get_attribute("src")        imageList.add(image)driver.quit()# 用Tesseract處理我們收集的圖片URL連結 for image in sorted(imageList):    # 儲存圖片    urlretrieve(image, "page.jpg")    p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE)    f = open("page.txt", "r")    p.wait() print(f.read())

知乎驗證碼處理案例：

網站產生的驗證碼圖片通常具有以下屬性：

它們是伺服器端的程式動態產生的圖片。驗證碼圖片的 src 屬性可能和普通圖片不太一樣，比如 <img src="WebForm.aspx?id=8AP85CQKE9TJ">，但是可以和其他圖片一樣進行下載和處理。
圖片的答案儲存在伺服器端的資料庫裡。
很多驗證碼都有時間限制，如果你太長時間沒解決就會失效。

驗證碼處理方法：

1.首先把驗證碼圖片下載到硬碟裡，清理乾淨，

2.然後用 Tesseract 處理圖片，

3.最後返回符合網站要求的識別結果。

#!/usr/bin/env python# -*- coding:utf-8 -*-import requestsimport timeimport pytesseractfrom PIL import Imagefrom bs4 import BeautifulSoupdef captcha(data):    with open(‘captcha.jpg‘,‘wb‘) as fp:        fp.write(data)    time.sleep(1)    image = Image.open("captcha.jpg")    text = pytesseract.image_to_string(image)    print "機器識別後的驗證碼為：" + text    command = raw_input("請輸入Y表示同意使用，按其他鍵自行重新輸入：")    if (command == "Y" or command == "y"):        return text    else:        return raw_input(‘輸入驗證碼：‘)def zhihuLogin(username,password):    # 構建一個儲存Cookie值的session對象    sessiona = requests.Session()    headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0‘}    # 先擷取頁面資訊，找到需要POST的資料（並且已記錄當前頁面的Cookie）    html = sessiona.get(‘https://www.zhihu.com/#signin‘, headers=headers).content    # 找到 name 屬性值為 _xsrf 的input標籤，取出value裡的值    _xsrf = BeautifulSoup(html ,‘lxml‘).find(‘input‘, attrs={‘name‘:‘_xsrf‘}).get(‘value‘)    # 取出驗證碼，r後面的值是Unix時間戳記,time.time()    captcha_url = ‘https://www.zhihu.com/captcha.gif?r=%d&type=login‘ % (time.time() * 1000)    response = sessiona.get(captcha_url, headers = headers)    data = {        "_xsrf":_xsrf,        "email":username,        "password":password,        "remember_me":True,        "captcha": captcha(response.content)    }    response = sessiona.post(‘https://www.zhihu.com/login/email‘, data = data, headers=headers)    print response.text    response = sessiona.get(‘https://www.zhihu.com/people/maozhaojun/activities‘, headers=headers)    print response.textif __name__ == "__main__":    #username = raw_input("username")    #password = raw_input("password")    zhihuLogin(‘[email protected]‘,‘ALAxxxxIME‘)

有兩種異常情況會導致這個程式運行失敗。

第一種情況是，如果 Tesseract 從驗證碼圖片中識別的結果不是四個字元(因為訓練樣本中驗證碼的所有有效答案都必須是四個字元)，結果不會被提交，程式失敗。

第二種情況是雖然識別的結果是四個字元，被提交到了表單，但是伺服器對結果不認可，程式仍然失敗。

在實際運行過程中，

第一種情況發生的可能性大約為 50%，發生時程式不會向表單提交，程式直接結束並提示驗證碼識別錯誤。

第二種異常情況發生的機率約為 20%，四個字元都對的機率約是 30%(每個字母的識別正確率大約是 80%，如果是五個字元都識別，正確的總機率是 32.8%)。

訓練Tesseract

流行的 PHP 內容管理系統 Drupal 有一個著名的驗證碼模組(https://www.drupal.org/project/captcha，可產生不同難度的驗證碼。

要訓練 Tesseract 識別一種文字，需向 Tesseract 提供每個字元不同形式的樣本。

Tesseract 的文檔：https://github.com/tesseract-ocr/tesseract/wiki

Python爬蟲開發【第1篇】【機器視覺及Tesseract】

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More