Python uses the Tesseract library for identification and verification, pythontesseract
I. Introduction to Tesseract
Tesseract is an OCR Library (OCR is short for Optical Character Recognition). It is used to scan text files, analyze and process image files, and obtain text and layout information, tesseract is currently recognized as the best OCR library for relatively precise recognition.
Ii. Use of Tesseract
1. Download and install Tesseract: Click to download
2. Set environment variables in Windows:
# Configure the environment variable set TESSDATA_PREFIX F: \ Tesseract-OCR \ According to the path of the downloaded Installation File \
3. Install the pytesseract Module
pip install pytesseract
4.how to import the tesseract.exe application in the pythonscript:
pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'
5. Case Study
Recognize text in slices:
Import pytesseractfrom PIL import Image #1. introduce the Tesseract program pytesseract. pytesseract. tesseract_cmd = r'f: \ Tesseract-OCR \ tesseract.exe '#2. use the Open () function in the Image module to Open image = Image.open('6.jpg ', mode = 'R') print (Image) #3. recognize image text code = pytesseract. image_to_string (image) print (code)
Result demonstration:
<PIL. Specify imageplugin. Specify imagefile image mode = RGB size = 611x210 at 0x1A5DFDCB4A8>
Google
Note: The recognition verification code of the tesseract-OCR engine cannot be identified. For example, the verification code generated by Douban cannot identify its content. If you need to crawl data from Douban, You need to manually enter the verification code:
3. Simulate login zhihu source code
Import requestsimport timeimport pytesseractfrom PIL import Imagefrom bs4 import BeautifulSoupdef captcha (data): with open('captcha.jpg ', 'wb') as fp: fp. write (data) time. sleep (1) image = Image. open ("captcha.jpg") text = pytesseract. image_to_string (image) print "the verification code after machine identification is:" + text command = raw_input ("Enter Y to agree, and press another key to re-enter :") if (command = "Y" or command = "y"): return text else: return raw_input ('Enter verification code: ') def zhihuLogin (username, password): # construct a session object sessiona = requests that saves the Cookie value. session () headers = {'user-agent': 'mozilla/5.0 (Windows NT 10.0; Win64; x64; rv: 47.0) gecko/20100101 Firefox/47.0 '} # first obtain the page information and find the data to POST (and the Cookie on the current page has been recorded). html = sessiona. get ('https: // www.zhihu.com/?signin', headers = headers ). content # locate the input tag whose name attribute value is _ xsrf, and retrieve the value _ xsrf = BeautifulSoup (html, 'lxml' ). Find ('input', attrs = {'name': '_ xsrf '}). get ('value') # obtain the verification code. The value after r is Unix Timestamp and time. time () captcha_url = 'https: // www.zhihu.com/captcha.gif? R = % d & type = login '% (time. time () * 1000) response = sessiona. get (captcha_url, headers = headers) data = {"_ xsrf": _ xsrf, "email": username, "password": password, "remember_me": True, "captcha": captcha (response. content)} response = sessiona. post ('https: // www.zhihu.com/login/email', data = data, headers = headers) print response. text response = sessiona. get ('https: // www.zhihu.com/people/maozhaojun/activities', headers = headers) print response. textif _ name _ = "_ main _": # username = raw_input ("username") # password = raw_input ("password ") zhihuLogin ('xxxx @ qq.com ', 'axxxxime ')
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.