This article describes how to use PHP to crack website verification codes. For more information, see
This article describes how to use PHP to crack website verification codes. For more information, see
The verification code function is generally set to prevent malicious registration, brute force cracking, or batch posting. The so-called verification code is to generate an image with a string of randomly generated numbers or symbols, and add some interference pixels to the image (preventing OCR). The user can identify the verification code information with the naked eye, enter a form to submit the website for verification. A function can be used only after the verification is successful. Learning the Verification Code cracking/recognition technology can not only understand the principles of the verification code, but also let you know how to prevent the verification code from being cracked.
The most common verification codes are as follows:
1. Four digits, a random string, and the original verification code. The verification function is almost zero.
2. Random image verification code. The characters in the image are relatively regular, some may be added with some random interferon, and some may be random character colors. The verification effect is better than the previous one. People who do not have the knowledge of basic graphics and Images cannot break through!
3. Random Numbers in various image formats + random uppercase English letters + random interference pixels + random positions.
4. Chinese characters are currently the latest verification codes registered. They are randomly generated, making them more difficult to create and affecting user experience. Therefore, there are usually few applications.
For the sake of simplicity, the attack description mainly targets 2nd types. Let's take a look at the common online verification code pictures:
Verification code recognition is generally divided into the following steps:
1. The pattern recognition verification code is taken out. After all, it is not a professional OCR recognition. In addition, because the verification codes of various websites are different, the most common method is to create a signature library for this verification code. When downloading the dashboard, We need to download several more images to make these images contain all the characters. The letters here are only images. Therefore, we only need to collect images including 0-9.
2. binarization: each pixel in the verification number on the image is represented by 1 in a number, and the other part is represented by 0. In this way, you can calculate each digital model, record these fonts, and use them as keys.
3. computation features binarization the images to be recognized to obtain image features.
4. Compare the image pattern of step 3 with the verification code to obtain the number on the verification image.
Currently, the verification code can be identified as 100%.
After completing the above steps, you may have said that you have not discovered how to retrieve interferon! In fact, the method to retrieve interferon is very simple. An important feature of interferon is that it does not affect the Display Effect of the Verification Code. Therefore, the RGB value of interferon may be lower than or higher than a specific value, for example, in the image I gave, the RGB values of interferon will not exceed 125, so we can easily remove interferon.
A simple verification code consists of only numbers and letters. The format is uniform and the position is fixed each time. Next, we will continue to study the verification code in depth. The purpose of this recognition is: The Verification Code consists of characters and numbers. The verification code is rotated (either left or right) and its position is not fixed, there is a adhesion between characters, and the verification code has stronger interferon.
We will explain it as an example.
Step 1: binarization. The verification code part is represented by 1, and the background part is represented by 0. The recognition method is very simple. We can print the RGB color of the entire verification code image, and then analyze its pattern. Through the RGB code, we can easily tell that the R value of the above image is greater than 120, and the G and B values are less than 80. Therefore, we can easily bind the above image according to this rule.
Let's take a look at the third Verification Code picture above.
It seems complicated. The background color of each verification code image is different, and the color of each verification code number is different. It seems difficult to binarization. In fact, we can easily print out the RGB values. Regardless of how the color of the verified digit changes, the RGB value of the digit is always less than 125, therefore, we can easily identify $ rgbarray ['red'] <125 | $ rgbarray ['green'] <125 | $ rgbarray ['blue'] <125 where is the number, where is the background.
The reason we can find these rules is that, in order to make the interferon of the Verification Code do not affect the Display Effect of digits, the RGB and RGB values of the interferon must be independent of each other and do not interfere with each other. As long as we understand this rule, we can easily achieve binarization.
The 120, 80,125, and other thresholds we found may differ from the actual RGB values. Therefore, sometimes, after binarization, 1 may appear in some places, and a number is displayed at a fixed position on the verification code, this interference is of little significance. However, for images with uncertain verification code positions, when we cut characters, it is likely to cause interference. Therefore, noise reduction is required after binarization.
Step 2: remove noise. The principle of noise removal is to remove isolated valid values. If the noise is high and the required efficiency is high, there is a lot of work to be done here. Fortunately, we do not need to be so advanced. We can use the simplest method, if the value of a vertex is 1, it determines whether the number in the top, bottom, top, top, bottom, and right of the vertex is 1. If the value is not 1, it is considered a dry point, set it to 1 directly.
As shown in, we use this method to easily find that 1 in the red box is dry, and set it to 1 directly. We used a technique to judge. Sometimes the noise may be two consecutive ones, so we calculate the sum of the values in the eight directions of the point, finally, we determine whether their sum is smaller than the specific threshold.
Step 3: Cut characters. There are many ways to cut characters. Here we use the simplest one. First we cut the vertical direction into characters, and then remove more than 0000 in the horizontal direction, as shown in
Step 1 cut the red line and Step 2 cut the blue line to get independent characters. But in the following case:
In the above method, the dw character is cut into one character. This is a wrong cut, so here we are involved in the cutting of the adhesion character.
Step 4: Stick character cutting. When creating a verification code, the sticking of the Rule characters is easy to split. If the characters themselves are scaled, deformation is very difficult to handle. After analysis, we can find that, the above character adhesion is a very simple method, but only the rule character adhesion, so we also use a very simple processing method to deal with this situation. After the split operation is completed, we cannot immediately determine that the split part is a character. The key factor for verification is whether the width of the cut characters exceeds the threshold, the trade-off criterion for this threshold value is that no matter how a character is rotated or deformed, it will not be greater than this threshold value. Therefore, if the cut block is greater than this threshold value, it can be considered as a sticking character; if it is greater than the sum of the two thresholds, it is considered to be three-character adhesion, and so on. After knowing this rule, it is easy to cut the adhesion characters. If we find that it is a sticking character block, we can directly divide this block into two or more new blocks. Of course, in order to better restore characters, I usually use the "equally divided" + 1,-1 to supplement the part of the character block.