Python implementation identifies weak image verification codes

Last Update:2018-09-05 Source: Internet

Author: User

Tags image processing library

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present, many websites in order to prevent the crawler wantonly simulates the browser to log in, uses increases the verification code the way to intercept the crawler. There are many forms of verification code, the most common is the picture verification code. Other verification codes in the form of audio verification Code, sliding verification code and so on. The image verification code is more and more advanced, the recognition difficulty also greatly increases, even if the human input also often loses the mistake. This paper mainly explains the identification of weak image verification code.

1 Image Verification Code strength

The image verification code mainly uses the interference line, the character adhesion, the character twist way to enhance the recognition difficulty.

Add interference Line

Plus the interference line is divided into two kinds, one is the line with the word descriptor color, the other is the color of the line is colorful.

Character adhesion

The spacing between the characters is small, depends on each other, can be divided.

Character Warp

The position of the character is displayed relative to the standard rotation at a certain angle.

The weakest verification code is not the above characteristics, the interference factor is smaller. As follows:

2 Identifying Ideas

First of all, two value of the image to reduce noise processing, remove the noise in the picture, interference lines and so on. Then separate the individual characters in the picture segmentation. Each character is identified last.

Image processing, I use the Python standard image processing library PIL. Image segmentation, I temporarily use the Google Open Source Library TESSERACT-OCR. Character recognition uses the Pytesseract library.

3 Installation

Pillow

The Python version I am using is 3.6, and the standard library PIL does not support 3.x. So you need to use Pillow instead. Pillow is a branch of PIL that is specifically compatible with the 3.x version. Installing Pillow with the PIP Package Management tool is the most convenient and quick.

Pip Install pillow# If there is no installation due to a download failure, we recommend that you use the proxy pip--proxy/HTTP/proxy IP: Port install Pillow

Tesseract-ocr

Tesseract: Open source OCR recognition engine, the initial tesseract engine was developed by HP Labs, later contributed to the open source software industry, and then improved by Google, eliminating bugs, optimizing, republishing. That's what makes it new.

We can find the library and download it on GitHub. I am downloading the latest version 4.0.

GitHub is: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows

Pytesseract

Pytesseract is a library of TESSERACT-OCR that provides a Python interface for packaging. You can also use the Pip method to install.

Pip Install pytesseract# If there is no installation due to a download failure, we recommend that you use the proxy pip--proxy/HTTP/proxy IP: Port install pytesseract

4 Code Implementation 4.1 Get and open picture

Get the picture verification code that you can download by using the Web request library. For your convenience, I downloaded the image locally and placed it in the project directory.

From PIL import image ' Get picture ' Def getImage ():    fileName = ' 16.jpg '    img = Image.open ()    # Print the current picture's mode and format    Print (' Pre-converted: ', Img.mode, Img.format)    # Open Picture with system default Tool    # img.show ()    return img

4.2 Preprocessing

This step is mainly to reduce the noise of the image processing, the picture from the "RGB" mode to "L" mode, that is, color images into black and white pictures. Then deal with the background noise, so that the character and background to form a black and white contrast.

"1) The image is de-noising, by two value to remove the back of the background color and deepen the text contrast" Def convert_image (IMG, standard=127.5): "    " grayscale Conversion ""    Image = Img.convert (' L ') '    "binary"    based on threshold standard, all pixels are set to 0 (black) or 255 (white), for the next split "'    pixels = im Age.load ()    for x in range (image.width):        for Y in range (image.height):            if pixels[x, Y] > Standard:                Pixels[x, y] = 255            Else:                pixels[x, Y] = 0    return image

Open the color picture, PIL will decode the picture as a three-channel "RGB" image. Call CONVERT (' L ') to convert the image to black and white. Where the mode "L" is a gray image, each pixel is represented by 8 bit, 0 is black, 255 is white, and the other numbers represent different shades of gray.

In PIL, the conversion from mode "RGB" to "L" mode is based on the following formula:

L = R value x 299/1000 + G value x 587/1000+ B value x 114/1000

The two value of the image is to polarize the grayscale value of the pixel on the image (set to 0 or 255,0 for black, 255 for white), that is, the entire image will be visible only black and white visual effects. The aim is to deepen the color difference between characters and backgrounds, and to facilitate tesseract recognition and segmentation. For the selection of thresholds, I used a more violent approach, using the average value of 0 and 255 directly.

4.3 identification

After the above processing, the characters in the image verification code have become very clear.

The final step is to identify directly with the Pytesseract library.

Import Pytesseract "uses the Pytesseract library to identify characters in the picture ' Def Change_image_to_text (img):" If the    location of the training library cannot be found, we need to manually automatically    syntax: tessdata_dir_config = '--tessdata-dir ' <replace_with_your_tessdata_dir_path> ' "'    testdata _dir_config = '--tessdata-dir ' c:\\program Files (x86) \\Tesseract-OCR\\tessdata "'    Textcode = pytesseract.image_to _string (IMG, lang= ' eng ', config=testdata_dir_config)    # Remove illegal characters, keep only alphanumeric    Textcode = re.sub ("\w", "", Textcode)    return Textcode

TESSERACT-ORC the installation path is not specified by default. We need to manually specify the path to the local tesseract. Otherwise the error will be reported:

Filenotfounderror: [Winerror 2] The system cannot find the file specified

Specific solutions are:

Use a text editor to open the pytesseract.py file for the Pytesseract library, with the following general path:

C:\Program Files (x86) \python35-32\lib\site-packages\pytesseract\pytesseract.py

Change the tesseract_cmd to the installation path of your computer's local TESSERACT-OCR.

# change this IF tesseract are not in YOUR PATH, OR is NAMED differentlytesseract_cmd = ' C:/Program Files (x86)/tesseract-o Cr/tesseract.exe

The last instance code that performs character recognition

def main ():    img = convert_image (getImage (fileName))    print (' Recognized result: ', Change_image_to_text (IMG)) If __name__ = = ' __main__ ':    Main ()

The results of the operation are as follows:

Pre-conversion:  RGB JPEG recognition results: 9834

5 Summary

Tesseract-orc for this weak verification code recognition rate is still possible, most of the characters can be correctly identified. But sometimes the number 8 is recognized as 0. If the image verification code becomes slightly more complex, the recognition rate is greatly reduced and will often be identified. I myself also tried to collect 500 pictures to train tesseract-orc, the recognition rate will increase, but the recognition rate is still very low.

If you want to achieve a high recognition rate, then you need to use the CNN (convolutional neural Network) or RNN (recurrent neural network) to train their own identification library. Just the machine learning is very popular, learning is also no harm.

Transferred from: https://cloud.tencent.com/developer/article/1187805

Python Learning Exchange Group: 548377875

Brother Monkey

Python implementation identifies weak image verification codes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More