Python uses the Tesseract library for identification and verification, pythontesseract

Last Update:2018-03-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction to Tesseract

Tesseract is an OCR Library (OCR is short for Optical Character Recognition). It is used to scan text files, analyze and process image files, and obtain text and layout information, tesseract is currently recognized as the best OCR library for relatively precise recognition.

Ii. Use of Tesseract

1. Download and install Tesseract: Click to download

2. Set environment variables in Windows:

# Configure the environment variable set TESSDATA_PREFIX F: \ Tesseract-OCR \ According to the path of the downloaded Installation File \

3. Install the pytesseract Module

pip install pytesseract

4.how to import the tesseract.exe application in the pythonscript:

pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'

5. Case Study

Recognize text in slices:

Import pytesseractfrom PIL import Image #1. introduce the Tesseract program pytesseract. pytesseract. tesseract_cmd = r'f: \ Tesseract-OCR \ tesseract.exe '#2. use the Open () function in the Image module to Open image = Image.open('6.jpg ', mode = 'R') print (Image) #3. recognize image text code = pytesseract. image_to_string (image) print (code)

Result demonstration:

<PIL. Specify imageplugin. Specify imagefile image mode = RGB size = 611x210 at 0x1A5DFDCB4A8>
Google

Note: The recognition verification code of the tesseract-OCR engine cannot be identified. For example, the verification code generated by Douban cannot identify its content. If you need to crawl data from Douban, You need to manually enter the verification code:

3. Simulate login zhihu source code

Import requestsimport timeimport pytesseractfrom PIL import Imagefrom bs4 import BeautifulSoupdef captcha (data): with open('captcha.jpg ', 'wb') as fp: fp. write (data) time. sleep (1) image = Image. open ("captcha.jpg") text = pytesseract. image_to_string (image) print "the verification code after machine identification is:" + text command = raw_input ("Enter Y to agree, and press another key to re-enter :") if (command = "Y" or command = "y"): return text else: return raw_input ('Enter verification code: ') def zhihuLogin (username, password): # construct a session object sessiona = requests that saves the Cookie value. session () headers = {'user-agent': 'mozilla/5.0 (Windows NT 10.0; Win64; x64; rv: 47.0) gecko/20100101 Firefox/47.0 '} # first obtain the page information and find the data to POST (and the Cookie on the current page has been recorded). html = sessiona. get ('https: // www.zhihu.com/?signin', headers = headers ). content # locate the input tag whose name attribute value is _ xsrf, and retrieve the value _ xsrf = BeautifulSoup (html, 'lxml' ). Find ('input', attrs = {'name': '_ xsrf '}). get ('value') # obtain the verification code. The value after r is Unix Timestamp and time. time () captcha_url = 'https: // www.zhihu.com/captcha.gif? R = % d & type = login '% (time. time () * 1000) response = sessiona. get (captcha_url, headers = headers) data = {"_ xsrf": _ xsrf, "email": username, "password": password, "remember_me": True, "captcha": captcha (response. content)} response = sessiona. post ('https: // www.zhihu.com/login/email', data = data, headers = headers) print response. text response = sessiona. get ('https: // www.zhihu.com/people/maozhaojun/activities', headers = headers) print response. textif _ name _ = "_ main _": # username = raw_input ("username") # password = raw_input ("password ") zhihuLogin ('xxxx @ qq.com ', 'axxxxime ')

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python uses the Tesseract library for identification and verification, pythontesseract

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python uses the Tesseract library for identification and verification, pythontesseract

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support