Python uses the Tesseract library for identification and verification, pythontesseract

Source: Internet
Author: User

Python uses the Tesseract library for identification and verification, pythontesseract

I. Introduction to Tesseract

Tesseract is an OCR Library (OCR is short for Optical Character Recognition). It is used to scan text files, analyze and process image files, and obtain text and layout information, tesseract is currently recognized as the best OCR library for relatively precise recognition.

Ii. Use of Tesseract

1. Download and install Tesseract: Click to download

2. Set environment variables in Windows:

# Configure the environment variable set TESSDATA_PREFIX F: \ Tesseract-OCR \ According to the path of the downloaded Installation File \

3. Install the pytesseract Module

pip install pytesseract

4.how to import the tesseract.exe application in the pythonscript:

pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'

5. Case Study

Recognize text in slices:

Import pytesseractfrom PIL import Image #1. introduce the Tesseract program pytesseract. pytesseract. tesseract_cmd = r'f: \ Tesseract-OCR \ tesseract.exe '#2. use the Open () function in the Image module to Open image = Image.open('6.jpg ', mode = 'R') print (Image) #3. recognize image text code = pytesseract. image_to_string (image) print (code)

Result demonstration:

<PIL. Specify imageplugin. Specify imagefile image mode = RGB size = 611x210 at 0x1A5DFDCB4A8>
Google

Note: The recognition verification code of the tesseract-OCR engine cannot be identified. For example, the verification code generated by Douban cannot identify its content. If you need to crawl data from Douban, You need to manually enter the verification code:

3. Simulate login zhihu source code

Import requestsimport timeimport pytesseractfrom PIL import Imagefrom bs4 import BeautifulSoupdef captcha (data): with open('captcha.jpg ', 'wb') as fp: fp. write (data) time. sleep (1) image = Image. open ("captcha.jpg") text = pytesseract. image_to_string (image) print "the verification code after machine identification is:" + text command = raw_input ("Enter Y to agree, and press another key to re-enter :") if (command = "Y" or command = "y"): return text else: return raw_input ('Enter verification code: ') def zhihuLogin (username, password): # construct a session object sessiona = requests that saves the Cookie value. session () headers = {'user-agent': 'mozilla/5.0 (Windows NT 10.0; Win64; x64; rv: 47.0) gecko/20100101 Firefox/47.0 '} # first obtain the page information and find the data to POST (and the Cookie on the current page has been recorded). html = sessiona. get ('https: // www.zhihu.com/?signin', headers = headers ). content # locate the input tag whose name attribute value is _ xsrf, and retrieve the value _ xsrf = BeautifulSoup (html, 'lxml' ). Find ('input', attrs = {'name': '_ xsrf '}). get ('value') # obtain the verification code. The value after r is Unix Timestamp and time. time () captcha_url = 'https: // www.zhihu.com/captcha.gif? R = % d & type = login '% (time. time () * 1000) response = sessiona. get (captcha_url, headers = headers) data = {"_ xsrf": _ xsrf, "email": username, "password": password, "remember_me": True, "captcha": captcha (response. content)} response = sessiona. post ('https: // www.zhihu.com/login/email', data = data, headers = headers) print response. text response = sessiona. get ('https: // www.zhihu.com/people/maozhaojun/activities', headers = headers) print response. textif _ name _ = "_ main _": # username = raw_input ("username") # password = raw_input ("password ") zhihuLogin ('xxxx @ qq.com ', 'axxxxime ')

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.