Python calls pytesseract to identify a website's verification code

Source: Internet
Author: User
The following is an example of how to call pytesseract to identify a website verification code in python. I think this is quite good. now I will share it with you and give you a reference. Let's take a look at the introduction of pytesseract.

1. pytesseract description

Pytesseract latest version 0.1.6, URL: https://pypi.python.org/pypi/pytesseract

Python-tesseract is a wrapper for google's Tesseract-OCR
(Http://code.google.com/p/tesseract-ocr/). It is also useful as
Stand-alone invocation script to tesseract, as it can read all image types
Supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
And others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
Text in stead of writing it to a file. Support for confidence estimates and
Bounding box data is planned for future releases.

Translate the following information:

A. Python-tesseract is an independent encapsulation package based on google's Tesseract-OCR;

B. The Python-tesseract function recognizes text in image files and returns recognition results as return parameters;

C. by default, Python-tesseract supports tiff and bmp images. only after PIL is installed can other image formats such as jpeg, gif, and png be supported;

2. install pytesseract

INSTALLATION:

Prerequisites:
* Python-tesseract requires python 2.5 or later or python 3.
* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
The package "python-imaging" or "python3-imaging" for python3.
* Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr.
You must be able to invoke the tesseract command as "tesseract". If this
Isn't the case, for example because tesseract isn't in your PATH, you will
Have to change the "tesseract_cmd" variable at the top of 'tesseract. py '.
Under Debian/Ubuntu you can use the package "tesseract-ocr ".

Installing via pip:

See the [pytesseract package page] (https://pypi.python.org/pypi/pytesseract)
"
$> Sudo pip install pytesseract

Translation:

A. Python-tesseract supports python2.5 and later versions;

B. Python-tesseract requires the installation of PIL (Python Imaging Library) to support more image formats;

C. Python-tesseract requires the installation of the tesseract-ocr installation package.

To sum up, Pytesseract principle:

1. as mentioned in the previous blog, run the command line tesseract.exe 1.png output-l eng to identify the Chinese character 1.png and output the identification result to output.txt;

2、pytesseractthe preceding procedure is encapsulated in a second way. The system automatically calls tesseract.exeand reads the content of output.txt as the return value of the function.

II. use pytesseract

USAGE:
"
> Try:
> Import Image
> Failed t ImportError:
> From PIL import Image
> Import pytesseract
> Print(pytesseract.image_to_string(Image.open('test.png ')))
> Print(pytesseract.image_to_string(Image.open('test-european.jpg '),))

You can see:

1. the core code is the image_to_string function. this function also supports the-l eng parameter and the-SM parameter.

Usage:

Image_to_string(Image.open('test.png '), lang = "eng" config = "-psm 7 ")

2. pytesseractwhen imageis used, only images are supported. In fact, tesseract.exe supports jpeg, png, and other image formats.

Instance code to identify the verification code of a public website (do not do anything bad, think twice, and finally hide the website domain name. let's try it on another website ......) :

#-*-Coding = utf-8-*-_ author __= 'zhongtang 'import using urllib2import cookielibimport mathimport using timeimport osimport htmltoolfrom using import * from PIL import Imagefrom PIL import ImageEnhanceimport re class export: def _ init _ (self): self. baseUrl =' http://jbywcg . * *** .Com.cn 'self.ht?htmltool.html tool () self. curPath = self. ht. getPyFileDir () self. authCode = ''def initUrllib2 (self): try: cookie = cookielib. cookieJar () cookieHandLer = handler (cookie) httpHandLer = handler (debuglevel = 0) httpsHandLer = handler (debuglevel = 0) handler T: raise else: opener = handler _ opener (cookieHandLer, httpHandLer, httpsHandLer) opener. addh Eaders = [('User-agent', 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.20.1.64 Safari/100')] urllib2.install _ opener (opener) def urllib2Navigate (self, url, data ={}): # define the connection function, with the retry function tryTimes = 0 while True: if (tryTimes> 20): print u "cannot connect to the network after multiple attempts, and the program terminates" break try: if (data =={}): req = urllib2.Request (url) else: req = urllib2.Request (url, urllib. urlencode (data) respon Se = urllib2.urlopen (req) bodydata = response. read () headerdata = response.info () if headerdata. get ('content-encoding') = 'gzip': rdata = StringIO. stringIO (bodydata) gz = gzip. gzipFile (fileobj = rdata) bodydata = gz. read () gz. close () tryTimes = tryTimes + 1 hour T urllib2.HTTPError, e: print 'httperror [% s] \ n' % e. code failed t urllib2.URLError, e: print 'urlerror [% s] \ n' % e. reason failed t socket. error: prin T u "connection failed, try to reconnect" else: break return bodydata, headerdata def randomCodeOcr (self, filename): image = Image. open (filename) # use ImageEnhance to increase the image recognition rate # enhancer = ImageEnhance. contrast (image) # enhancer = enhancer. enhance (4) image = image. convert ('L') ltext = ''' ltext = image_to_string (image) # Remove invalid characters and retain only letters and numbers (ltext = re. sub ("\ W", "", ltext) print U' [% s] recognized verification code: [% s]! '% (Filename, ltext) image. save (filename) # print ltext return ltext def getRandomCode (self): # start to get the verification code # http://jbywcg . *** .Com.cn/CommonPage/Code.aspx? 0.9409255818463862 I = 0 while (I <= 100): I + = 1 # splice the verification Code Url randomUrlNew = '% s/CommonPage/Code. aspx? % S' % (self. baseUrl, random. random () # splice the verification code local file name filename = 'Your S.png '% (I) filename = OS. path. join (self. curPath, filename) jpgdata, jpgheader = self. urllib2Navigate (randomUrlNew) if len (jpgdata) <= 0: print U' an error occurred while obtaining the verification code! \ N' return False f = open (filename, 'wb') f. write (jpgdata) # print u "Save Image:", fileName f. close () self. authCode = self. randomCodeOcr (filename) # The main program starts orcln = orclnypcg () orcln. initUrllib2 () orcln. getRandomCode ()

III. pytesseract code optimization

When the above programs run on the windows platform, they will find a black console window flashing, unfriendly.

Slightly modified pytesseract. py (C: \ Python27 \ Lib \ site-packages \ pytesseract directory) to hide the above process.

# Modified by zhongtang hide console window
# New code
IS_WIN32 = 'win32 'in str (sys. platform). lower ()
If IS_WIN32:
Startupinfo = subprocess. STARTUPINFO ()
Startupinfo. dwFlags | = subprocess. STARTF_USESHOWWINDOW
Startupinfo. wShowWindow = subprocess. SW_HIDE
Proc = subprocess. Popen (command,
Stderr = subprocess. PIPE, startupinfo = startupinfo)
'''
# Old code
Proc = subprocess. Popen (command,
Stderr = subprocess. PIPE)
'''
# Modified end

To facilitate beginners, paste pytesseract. py.

#!/usr/bin/env python'''Python-tesseract is an optical character recognition (OCR) tool for python.That is, it will recognize and "read" the text embedded in images. Python-tesseract is a wrapper for google's Tesseract-OCR( http://code.google.com/p/tesseract-ocr/ ). It is also useful as astand-alone invocation script to tesseract, as it can read all image typessupported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,and others, whereas tesseract-ocr by default only supports tiff and bmp.Additionally, if used as a script, Python-tesseract will print the recognizedtext in stead of writing it to a file. Support for confidence estimates andbounding box data is planned for future releases.  USAGE:" > try: >   import Image > except ImportError: >   from PIL import Image > import pytesseract > print(pytesseract.image_to_string(Image.open('test.png'))) > print(pytesseract.image_to_string(Image.open('test-european.jpg'),))" INSTALLATION: Prerequisites:* Python-tesseract requires python 2.5 or later or python 3.* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is the package "python-imaging" or "python3-imaging" for python3.* Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ . You must be able to invoke the tesseract command as "tesseract". If this isn't the case, for example because tesseract isn't in your PATH, you will have to change the "tesseract_cmd" variable at the top of 'tesseract.py'. Under Debian/Ubuntu you can use the package "tesseract-ocr".  Installing via pip:  See the [pytesseract package page](https://pypi.python.org/pypi/pytesseract)   $> sudo pip install pytesseract   Installing from source:  $> git clone git@github.com:madmaze/pytesseract.git  $> sudo python setup.py install    LICENSE:Python-tesseract is released under the GPL v3. CONTRIBUTERS:- Originally written by [Samuel Hoffstaetter](https://github.com/hoffstaetter) - [Juarez Bochi](https://github.com/jbochi)- [Matthias Lee](https://github.com/madmaze)- [Lars Kistner](https://github.com/Sr4l) ''' # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLYtesseract_cmd = 'tesseract' try:  import Imageexcept ImportError:  from PIL import Imageimport subprocessimport sysimport tempfileimport osimport shlex __all__ = ['image_to_string'] def run_tesseract(input_filename, output_filename_base,, boxes=False, config=None):  '''  runs the command:    `tesseract_cmd` `input_filename` `output_filename_base`     returns the exit status of tesseract, as well as tesseract's stderr output   '''  command = [tesseract_cmd, input_filename, output_filename_base]     if lang is not None:    command += ['-l', lang]   if boxes:    command += ['batch.nochop', 'makebox']       if config:    command += shlex.split(config)       # modified by zhongtang hide console window  # new code  IS_WIN32 = 'win32' in str(sys.platform).lower()  if IS_WIN32:    startupinfo = subprocess.STARTUPINFO()    startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW    startupinfo.wShowWindow = subprocess.SW_HIDE  proc = subprocess.Popen(command,      stderr=subprocess.PIPE,startupinfo=startupinfo)  '''  # old code  proc = subprocess.Popen(command,      stderr=subprocess.PIPE)  '''  # modified end     return (proc.wait(), proc.stderr.read()) def cleanup(filename):  ''' tries to remove the given filename. Ignores non-existent files '''  try:    os.remove(filename)  except OSError:    pass def get_errors(error_string):  '''  returns all lines in the error_string that start with the string "error"   '''   lines = error_string.splitlines()  error_lines = tuple(line for line in lines if line.find('Error') >= 0)  if len(error_lines) > 0:    return '\n'.join(error_lines)  else:    return error_string.strip() def tempnam():  ''' returns a temporary file-name '''  tmpfile = tempfile.NamedTemporaryFile(prefix="tess_")  return tmpfile.name class TesseractError(Exception):  def __init__(self, status, message):    self.status = status    self.message = message    self.args = (status, message) def image_to_string(image,, boxes=False, config=None):  '''  Runs tesseract on the specified image. First, the image is written to disk,  and then the tesseract command is run on the image. Resseract's result is  read, and the temporary files are erased.     also supports boxes and config.     if boxes=True    "batch.nochop makebox" gets added to the tesseract call  if config is set, the config gets appended to the command.    ex: config="-psm 6"   '''   if len(image.split()) == 4:    # In case we have 4 channels, lets discard the Alpha.    # Kind of a hack, should fix in the future some time.    r, g, b, a = image.split()    image = Image.merge("RGB", (r, g, b))     input_file_name = '%s.bmp' % tempnam()  output_file_name_base = tempnam()  if not boxes:    output_file_name = '%s.txt' % output_file_name_base  else:    output_file_name = '%s.box' % output_file_name_base  try:    image.save(input_file_name)    status, error_string = run_tesseract(input_file_name,                       output_file_name_base,                       lang=lang,                       boxes=boxes,                       config=config)    if status:      #print 'test' , status,error_string      errors = get_errors(error_string)      raise TesseractError(status, errors)    f = open(output_file_name)    try:      return f.read().strip()    finally:      f.close()  finally:    cleanup(input_file_name)    cleanup(output_file_name) def main():  if len(sys.argv) == 2:    filename = sys.argv[1]    try:      image = Image.open(filename)      if len(image.split()) == 4:        # In case we have 4 channels, lets discard the Alpha.        # Kind of a hack, should fix in the future some time.        r, g, b, a = image.split()        image = Image.merge("RGB", (r, g, b))    except IOError:      sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)      exit(1)    print(image_to_string(image))  elif len(sys.argv) == 4 and sys.argv[1] == '-l':   .argv[2]    filename = sys.argv[3]    try:      image = Image.open(filename)    except IOError:      sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)      exit(1)    print(image_to_string(image,))  else:    sys.stderr.write('Usage: python pytesseract.py [-l language] input_file\n')    exit(2) if __name__ == '__main__':  main()

Above ......

In the above python example, the implementation of calling pytesseract to identify a website's verification code is all the content that I have shared with you. I hope you can provide a reference and support for PHP's Chinese web.

For more information about how to call pytesseract in python to identify a website's verification code, refer to PHP's Chinese website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.