Currently, many websites have added the verification code technology to enhance security and prevent programs from automatically operating their websites. However, it is troublesome for the majority of webmasters to promote and publicize websites. So I am preparing to write this article on the verification code recognition technology. The shortcomings are inevitable! I never write anything. I wrote it today to get out of date!
The majority of webmasters often post advertisements to promote their websites. If they rely on humans and are too slow and expensive, the ideal solution is to use group-based software, but now many websites have verification codes, which has become the technical difficulty of group-sending software, and recognition is also the difficulty. Well, let's talk less and get down to the truth!
The example I used is a verification code that is difficult to identify. I will not write any code here, instead of discussing the verification code without deformation, font changing, size changing, or rotation, I just provided the idea I wrote. According to this idea, the program I wrote is much higher than the program I sold in the market. (If you are interested, you can ask me if I don't want to help others publicize it here ~~)
First, it starts with a digital verification code. It is a little more difficult for letters than numbers, but it is not difficult to identify the digital verification code.
The verification codes are generally images and generally contain 4 digits. The processing process is: divide the verification codes into 4 parts and identify them one by one. Since the verification codes are simple, I will not talk about them here, I will only talk about how to identify it.
My method is to divide the images to be recognized into five rows, three columns, and 15 blocks. Why should we divide them into 15 blocks? First look at the picture!
○ ■ ○
■ ○ ■Corresponding number 0
■ ○ ■
■ ○ ■
○ ■ ○
○ ■ ○
■ ○
○ ■ ○Corresponding to number 1
○ ■ ○
■
■
○ ■
■Corresponding number 2
■ ○
■
■
○ ■
■Corresponding to number 3
○ ■
■
Let me give these four examples first. You can draw them by yourself. If you have been a friend of verification code recognition, you will soon understand why it is divided into 15 blocks. In fact, this is mainly because the Division is more reasonable and the recognition rate can be improved.
My method is to divide the image to be recognized into five rows, three columns, and 15 blocks, and then calculate each block, when the percentage of valid pixels in each block exceeds, it is marked as ■. If it does not exceed, it is marked as ○. (I used ■, ○ for convenience of display, you can mark it as 1, 0). Note that the percentage here can be 67%, 50%, 33%, 20%, or based on the font width. Why do you need these numbers? It is mainly related to the floating-point calculation of the computer. If you select these numbers, the calculation is faster and it is not prone to errors. Otherwise, the computer will also encounter errors when performing a large number of calculations! Of course here, you can select the percentage (http://www.my400800.cn/) suitable for your verification code picture /)!!
If the verification code is not deformed, the font is not changed, the size is not changed, and the rotation is not performed, the recognition process is basically over, because a clear block chart can be obtained, it is enough to deal with most forums. Pai_^
If the verification code has a large deformation, many fonts, Unfixed sizes, and rotation, we may get a chart like this after the division and ratio display:
○ ■ ○
○ ■
○ ■ ○
■ ○
■
So what is this number? We need to use exclusion! Exclude all the impossible. In 0123456789, this figure cannot be 013456789, but it can only be: 2.
A friend who has written verification code recognition may already understand it! Yes, we need to create a similar database, that is, the recognition database. The figure that appears will belong to that number.
Another example:
○ ■ ○
■ ○
■
■ ○ ■
○ ■ ○
Which number is this ?? It's 6. That's right.
Here I need to explain why we need to take 5 rows, 3 columns, and 15 blocks. Because there are too many blocks, your recognition library will be very large and there will be too few blocks, there will be a lot of unclear blocks.
In addition, pay attention to the percentage you get. It cannot be too large or too small.
Well, when you prepare your own database, you can recognize most of the numbers.
The last problem is the repetition. For example, the number on the image is clearly 5, but its font is not a common font and is rotated, finally, the following figure is displayed:
■
■ ○
■
■ ○ ■
■
In my database, this block chart is 6, which means identifying errors. What should I do?
My solution is to delete the data in the database because it is wrong.
In this case, you need to perform secondary processing. My method is to decrease the percentage and then we get:
■ ○
■ ○
■ ○
○ ■
■ ○
OK. After the percentage is reduced, the image changes from "6" to "5 ~~~ As the percentage is reduced, we need to create another identification database to store the data.
This is just an idea
Only for thinking