Automatic identification of Web site verification code

Source: Internet
Author: User

Automatic Verification Code identification

When you log in to many Web sites, you need to enter a verification code, and Python provides libraries (such as commonly used OCR libraries) to identify and use text in online images.

Translating images into text is generally referred to as optical word recognition (Optical Character recognition, OCR). There are not many low-level libraries that can implement OCR, and many libraries now use a common number of underlying OCR libraries or are customized on top of them.

1.1 Orc Library Overview

Python has always been a very good language for tasks such as reading and processing images, image-related machine learning, and creating images. Although there are many libraries that can be used for image processing, here we only highlight: Tesseract

1.2 Tesseract

Tesseract is an OCR library that is currently sponsored by Google (Google is also a company known for its OCR and machine learning technologies). Tesseract is currently recognized as the best and most accurate open source OCR system. In addition to its extremely high accuracy, the tesseract is also highly flexible. It can be trained to recognize any font and can recognize any Unicode character.

1.3 tesseract-ocr4.0 Installation

tesseract-ocr4.0 The installation steps are as follows:

1. software download.

software Download URL:https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows

Select 4.0.0-alpha for Windows under Windows Installer made with mingw-w64 from UB Mannheim, click UB Mannheim to go to another webpage.

The entry URL is https://github.com/UB-Mannheim/tesseract/wiki.

Click Tesseract-ocr-w64-setup-v4.0.0-beta.1.20180608.exe to download the 4.0 version of the software.

Note: The window System version does not download the corresponding version.

2. Double click to install the software, the installation process according to the following picture instructions.

Note: Choose to install the language pack, including English (the default installation), Chinese, mathematical formulas, etc., can be downloaded as needed.

After installation, open the directory for the Software installation.

Note:

If you do not do the English text recognition, you also need to download the other language identification package https://github.com/tesseract-ocr/tesseract/wiki/Data-Files.

Simplified Chinese identification kit:Https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata

environment variable configuration for 1.4 TESSERACT-OCR

After the installation is complete, the software can only be used in the directory where the software is located, in order to be able to call the software in any directory under CMD, you need to configure the C:\Tesseract-OCR\tessdata to the system environment variable.

My Computer, right-click Properties, open Interface, tap advanced system settings, click "Environment variables" in the Environment variable interface Select "System Variables", and then click Edit, open the Edit environment variable interface, the C \ TESSERACT-OCR is added to the variable, click OK.

Configuration complete, open the command terminal, enter: Tesseract-v, you can see the version information.

If an error occurs, it is estimated that the environment variable is not configured properly.

Here, we even if the installation is complete, but our system still does not recognize Chinese, we have to download the Simplified Chinese character, the Traditional Chinese language pack (provided above), downloaded and then placed in the installation directory Tessconfigs directory.

Note: Because there is no global variable configured to perform data conversions across disks, here we add a configuration message to the environment variable.

Under System variables, click "New", variable name tessdata_prefix, variable value C:\Tesseract-OCR\tessdata, then click OK, then click OK, then finish the setup.

Use of 1.5 TESSERACT-OCR

TESSERACT-OCR does not have a window interface and can only be called by command, which requires CMD or PowerShell.

CMD on can be opened by, start->windows System-a command prompt or press the shortcut key Win+r enter CMD after the return call.

1. make 2 pictures first. T_JPG1 and T_png1.

2. in the cmd command line, enter:

format : tesseract name of the resulting file generated by the picture name font

Example:

Tesseract C:\image\T_jpg1.jpg C:\image\T_jpg.txt-l Chi_sim

then enter.

"T_jpg1" is a t_jpg1 picture under the C:\image directory.

"T_jpg.txt" is the output of the specified result to the C:\image\T_jpg.txt text file.

-L is the specified package to use.

"Chi_sim" is a Chinese identification package.

After the carriage return, wait for the running result, after running finished, see the T_jpg.txt file in directory C:\image, open.

1.6 Installing Pytesseract

If you need to identify the image on Python, you need to install the Python version of the Pytesseract library, before installing the Pytesseract library, you need to Pillow the library, using the command: Pip install Pillow.

Then install pip install Pytesseract, the installation is successful.

If you do not install the Pillow library, the following error will be reported.

1.7 Processing Specification Text case

Create a new test.png image and save it in the current program directory.

Example:

Import Pytesseract

From PIL import Image

#如果不修改pytesseract. py file, you can specify Tesseract_cmd's running file in the program

#tesseract_cmd = ' C:/tesseract-ocr/tesseract.exe '

Image = Image.open (' test.png ')

Text = pytesseract.image_to_string (image)

Print (text)

Error:

You need to modify the pytesseract.py file under C:\Python35\Lib\site-packages\pytesseract.

Modify the Tesseract_cmd value in the pytesseract.py file.

Tesseract_cmd = ' C:\Tesseract-OCR\tesseract.exe '

After modifying, run the program as follows:

1.8 Identification of simple login verification code

Most web site-generated Authenticode pictures have the following properties:

1. They are images that are dynamically generated by the server-side program. The src attribute of the captcha image may not be the same as the normal picture, such as

2. The answer to the image is stored in the server-side database.

3. Many verification codes have a time limit, and if you do not resolve them for too long, they will expire.

4. The commonly used method is to first download the verification code image to the hard disk, clean it up, then use tesseract to process the picture, and finally return to meet the site requirements of the recognition results.

Example: Login Verification Code Picture

Program:

Import Pytesseract

From PIL import Image

#如果不修改pytesseract. py file, you can specify Tesseract_cmd's running file in the program

#tesseract_cmd = ' C:/tesseract-ocr/tesseract.exe '

Image = Image.open (' yzm1.png ')

Text = pytesseract.image_to_string (image)

Print (text)

Operation Result:

Through the example, it can be seen that the recognition is error, which requires the machine self-learning, we need to train tesseract.

Web site Verification Code automatic identification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.