At present, many websites in order to prevent the crawler wantonly simulates the browser to log in, uses increases the verification code the way to intercept the crawler. There are many forms of verification code, the most common is the picture verification code. Other verification codes in the form of audio verification Code, sliding verification code and so on. The image verification code is more and more advanced, the recognition difficulty also greatly increases, even if the human input also often loses the mistake. This paper mainly explains the identification of weak image verification code.
1 Image Verification Code strength
The image verification code mainly uses the interference line, the character adhesion, the character twist way to enhance the recognition difficulty.
Plus the interference line is divided into two kinds, one is the line with the word descriptor color, the other is the color of the line is colorful.
The spacing between the characters is small, depends on each other, can be divided.
The position of the character is displayed relative to the standard rotation at a certain angle.
The weakest verification code is not the above characteristics, the interference factor is smaller. As follows:
2 Identifying Ideas
First of all, two value of the image to reduce noise processing, remove the noise in the picture, interference lines and so on. Then separate the individual characters in the picture segmentation. Each character is identified last.
Image processing, I use the Python standard image processing library PIL. Image segmentation, I temporarily use the Google Open Source Library TESSERACT-OCR. Character recognition uses the Pytesseract library.
3 Installation
The Python version I am using is 3.6, and the standard library PIL does not support 3.x. So you need to use Pillow instead. Pillow is a branch of PIL that is specifically compatible with the 3.x version. Installing Pillow with the PIP Package Management tool is the most convenient and quick.
Pip Install pillow# If there is no installation due to a download failure, we recommend that you use the proxy pip--proxy/HTTP/proxy IP: Port install Pillow
Tesseract: Open source OCR recognition engine, the initial tesseract engine was developed by HP Labs, later contributed to the open source software industry, and then improved by Google, eliminating bugs, optimizing, republishing. That's what makes it new.
We can find the library and download it on GitHub. I am downloading the latest version 4.0.
GitHub is: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows
Pytesseract is a library of TESSERACT-OCR that provides a Python interface for packaging. You can also use the Pip method to install.
Pip Install pytesseract# If there is no installation due to a download failure, we recommend that you use the proxy pip--proxy/HTTP/proxy IP: Port install pytesseract
4 Code Implementation 4.1 Get and open picture
Get the picture verification code that you can download by using the Web request library. For your convenience, I downloaded the image locally and placed it in the project directory.
From PIL import image ' Get picture ' Def getImage (): fileName = ' 16.jpg ' img = Image.open () # Print the current picture's mode and format Print (' Pre-converted: ', Img.mode, Img.format) # Open Picture with system default Tool # img.show () return img
4.2 Preprocessing
This step is mainly to reduce the noise of the image processing, the picture from the "RGB" mode to "L" mode, that is, color images into black and white pictures. Then deal with the background noise, so that the character and background to form a black and white contrast.
"1) The image is de-noising, by two value to remove the back of the background color and deepen the text contrast" Def convert_image (IMG, standard=127.5): " " grayscale Conversion "" Image = Img.convert (' L ') ' "binary" based on threshold standard, all pixels are set to 0 (black) or 255 (white), for the next split "' pixels = im Age.load () for x in range (image.width): for Y in range (image.height): if pixels[x, Y] > Standard: Pixels[x, y] = 255 Else: pixels[x, Y] = 0 return image
Open the color picture, PIL will decode the picture as a three-channel "RGB" image. Call CONVERT (' L ') to convert the image to black and white. Where the mode "L" is a gray image, each pixel is represented by 8 bit, 0 is black, 255 is white, and the other numbers represent different shades of gray.
In PIL, the conversion from mode "RGB" to "L" mode is based on the following formula:
L = R value x 299/1000 + G value x 587/1000+ B value x 114/1000
The two value of the image is to polarize the grayscale value of the pixel on the image (set to 0 or 255,0 for black, 255 for white), that is, the entire image will be visible only black and white visual effects. The aim is to deepen the color difference between characters and backgrounds, and to facilitate tesseract recognition and segmentation. For the selection of thresholds, I used a more violent approach, using the average value of 0 and 255 directly.
4.3 identification
After the above processing, the characters in the image verification code have become very clear.
The final step is to identify directly with the Pytesseract library.
Import Pytesseract "uses the Pytesseract library to identify characters in the picture ' Def Change_image_to_text (img):" If the location of the training library cannot be found, we need to manually automatically syntax: tessdata_dir_config = '--tessdata-dir ' <replace_with_your_tessdata_dir_path> ' "' testdata _dir_config = '--tessdata-dir ' c:\\program Files (x86) \\Tesseract-OCR\\tessdata "' Textcode = pytesseract.image_to _string (IMG, lang= ' eng ', config=testdata_dir_config) # Remove illegal characters, keep only alphanumeric Textcode = re.sub ("\w", "", Textcode) return Textcode
TESSERACT-ORC the installation path is not specified by default. We need to manually specify the path to the local tesseract. Otherwise the error will be reported:
Filenotfounderror: [Winerror 2] The system cannot find the file specified
Specific solutions are:
Use a text editor to open the pytesseract.py file for the Pytesseract library, with the following general path:
C:\Program Files (x86) \python35-32\lib\site-packages\pytesseract\pytesseract.py
Change the tesseract_cmd to the installation path of your computer's local TESSERACT-OCR.
# change this IF tesseract are not in YOUR PATH, OR is NAMED differentlytesseract_cmd = ' C:/Program Files (x86)/tesseract-o Cr/tesseract.exe
The last instance code that performs character recognition
def main (): img = convert_image (getImage (fileName)) print (' Recognized result: ', Change_image_to_text (IMG)) If __name__ = = ' __main__ ': Main ()
The results of the operation are as follows:
Pre-conversion: RGB JPEG recognition results: 9834
5 Summary
Tesseract-orc for this weak verification code recognition rate is still possible, most of the characters can be correctly identified. But sometimes the number 8 is recognized as 0. If the image verification code becomes slightly more complex, the recognition rate is greatly reduced and will often be identified. I myself also tried to collect 500 pictures to train tesseract-orc, the recognition rate will increase, but the recognition rate is still very low.
If you want to achieve a high recognition rate, then you need to use the CNN (convolutional neural Network) or RNN (recurrent neural network) to train their own identification library. Just the machine learning is very popular, learning is also no harm.
Transferred from: https://cloud.tencent.com/developer/article/1187805
Python Learning Exchange Group: 548377875
Brother Monkey
Python implementation identifies weak image verification codes