Php verification code recognition (intermediate)

Source: Internet
Author: User
In the previous article & lt; php verification code recognition (preliminary) example. This article will... SyntaxHig

In the previous article In, I explained how to identify simple verification. here, the simple verification code is composed of numbers and letters. the format is uniform and the position is fixed each time. This article will continue to study the verification code in depth. the purpose of this recognition is that the verification code consists of characters and numbers. the verification code is rotated (either left or right) and its position is not fixed, there is a adhesion between characters, and the verification code has stronger interferon. The method described in this article is not a omnipotent solution, and providing code cannot directly solve your problem. here is only a solution. the readers can solve the problem by themselves, the verification code has nothing to do with the specific programming language. here we only use the php language to implement the verification code. the method described here can be implemented in any language.

This article describes the steps involved in the verification code recognition process.


 

For example, the subsequent explanations are all centered on this figure.
I. When we get a verification code, the first thing we need to do at first glance is binarization. The verification code part is represented by 1, and the background part is represented by 0. the recognition method is very simple. we can print the RGB color of the image on which the verification code is displayed, and then analyze its regular pattern through the RGB code, we can easily tell that the R value of the above image is greater than 120, and the G and B values are less than 80. Therefore, we can easily bind the above image according to this rule. Let's look at the two images recognized in the first article.


It seems complicated. The background color of each verification code image is different, and the color of each verification code number is different. It seems difficult to binarization. In fact, we can easily print out the RGB values. No matter how the color of the verified digit changes, the RGB value of the digit is always less than 125.

$ Rgbarray ['red'] <125 | $ rgbarray ['green'] <125 | $ rgbarray ['blue'] <125

We can easily tell where the numbers are and where the background is.

We can find that the factor of these rules is that, in order to make the interferon of the verification code do not affect the display effect of digits, the RGB and RGB values of the interferon must be independent of each other and do not interfere with each other. As long as we understand this rule, we can easily achieve binarization.

The 120, 80,125, and other thresholds we found may differ from the actual RGB values. Therefore, sometimes, after binarization, 1 May appear in some places, and a number is displayed at a fixed position on the verification code, this interference is of little significance. However, for images with uncertain verification code positions, when we cut characters, it is likely to cause interference. Therefore, noise is removed after binarization.

II. next we will proceed to the second step to obtain noise. The principle of dryness is very simple, that is, to remove the effective values of isolation. if the noise is high and the required efficiency is high, there is a lot of work to be done here. Fortunately, we do not need to be so advanced. we can use the simplest method, if the value of a vertex is 1, it determines whether the number in the top, bottom, top, top, bottom, and right of the vertex is 1. if the value is not 1, it is considered a dry point, set it to 1 directly.




 

As shown in, we use this method to easily find that 1 in the red box is dry, and set it to 1 directly.

When judging, we use a technique. sometimes the noise may be two consecutive ones, so we

[Php: collapse] + expand sourceview plaincopyprint? $ Num = 0;
If ($ data [$ I] [$ j] = 1)
{
// Upload
If (isset ($ data [$ i-1] [$ j]) {
$ Num = $ num + $ data [$ i-1] [$ j];
}
// Lower
If (isset ($ data [$ I + 1] [$ j]) {
$ Num = $ num + $ data [$ I + 1] [$ j];
}
// Left
If (isset ($ data [$ I] [$ j-1]) {
$ Num = $ num + $ data [$ I] [$ j-1];
}
// Right
If (isset ($ data [$ I] [$ j + 1]) {
$ Num = $ num + $ data [$ I] [$ j + 1];
}
// Upper left
If (isset ($ data [$ i-1] [$ j-1]) {
$ Num = $ num + $ data [$ i-1] [$ j-1];
}
// Upper right
If (isset ($ data [$ i-1] [$ j + 1]) {
$ Num = $ num + $ data [$ i-1] [$ j + 1];
}
// Bottom left
If (isset ($ data [$ I + 1] [$ j-1]) {
$ Num = $ num + $ data [$ I + 1] [$ j-1];
}
// Bottom right
If (isset ($ data [$ I + 1] [$ j + 1]) {
$ Num = $ num + $ data [$ I + 1] [$ j + 1];
}
}
If ($ num = 0 ){
$ Data [$ I] [$ j] = 0;
}
$ Num = 0;
If ($ data [$ I] [$ j] = 1)
{
// Upload
If (isset ($ data [$ i-1] [$ j]) {
$ Num = $ num + $ data [$ i-1] [$ j];
}
// Lower
If (isset ($ data [$ I + 1] [$ j]) {
$ Num = $ num + $ data [$ I + 1] [$ j];
}
// Left
If (isset ($ data [$ I] [$ j-1]) {
$ Num = $ num + $ data [$ I] [$ j-1];
}
// Right
If (isset ($ data [$ I] [$ j + 1]) {
$ Num = $ num + $ data [$ I] [$ j + 1];
}
// Upper left
If (isset ($ data [$ i-1] [$ j-1]) {
$ Num = $ num + $ data [$ i-1] [$ j-1];
}
// Upper right
If (isset ($ data [$ i-1] [$ j + 1]) {
$ Num = $ num + $ data [$ i-1] [$ j + 1];
}
// Bottom left
If (isset ($ data [$ I + 1] [$ j-1]) {
$ Num = $ num + $ data [$ I + 1] [$ j-1];
}
// Bottom right
If (isset ($ data [$ I + 1] [$ j + 1]) {
$ Num = $ num + $ data [$ I + 1] [$ j + 1];
}
}
If ($ num = 0 ){
$ Data [$ I] [$ j] = 0;
}


We calculate the sum of the values in the eight directions of this point, and finally determine whether their sum is less than the specific threshold.
3: After noise removal, we will get clean binarization data. next we will cut the characters. There are many ways to cut characters. here I use the simplest one, first vertically cut into characters, and then remove more than 0000 in the horizontal direction, such




 

Step 1 Cut the red line and step 2 cut the blue line to get independent characters. But in the following case:



In the above method, the dw character is cut into one character. this is a wrong cut, so here we are involved in the cutting of the adhesion character.
4. adhesive character cutting: When the verification code is created, the adhesion of the rule characters is easy to split. if the character itself scales, deformation is difficult to handle. after analysis, we can find that, the above character adhesion is a very simple method, but only the rule character adhesion, so we also use a very simple processing method to deal with this situation. After the split operation is completed, we cannot immediately determine that the split part is a character. the key factor for verification is whether the width of the cut characters exceeds the threshold, the trade-off criterion for this threshold value is that no matter how a character is rotated or deformed, it will not be greater than this threshold value. Therefore, if the cut block is greater than this threshold value, it can be considered as a sticking character; if it is greater than the sum of the two thresholds, it is considered to be three-character adhesion, and so on. After knowing this rule, it is easy to cut the adhesion characters. If we find that it is a sticking character block, we can directly divide this block into two or more new blocks. Of course, in order to better restore characters, I usually use the "equally divided" + 1,-1 to supplement the part of the character block.
5. after the above four steps, we can extract pure character blocks. The next step is to match the characters. There are many ways to create a pattern for rotating characters, so we will not study it in depth here. The simplest method I use here is to create a matching library for all characters, so the study operation is added in the code I provide. The purpose is, first, someone manually identifies the image verification code, and then writes it to the signature library through the study method. In this way, the more image data is written, the higher the validation and recognition accuracy.
Well, after the above steps, we can basically identify most of the verification codes on the internet. here we are using the simplest method, without any OCR knowledge. These methods should be at the top of the non-OCR field. to identify more complex verification codes, more OCR knowledge is required. If you have the opportunity, I will introduce it one by one in the advanced article.
The following are some easy-to-recognize verification codes that may attract the attention of website administrators.
 

 



 

Suggestions for creating verification codes
For verification code recognition programs, the most rare part is the cutting of verification characters and the establishment of signatures. many programmers in China always like to add a lot of interferon and interference lines to the verification code when they only do the verification code, not to mention the effect, but not to achieve good results; therefore, to make your own verification code difficult to identify, only the following two points are enough.
1: character Adhesion. it is recommended that all characters have adhesion;
2: Do not use specification characters. each part of the verification code is scaled or rotated in different proportions.
As long as these two points are achieved, or the deformation of these two points, it is difficult for the recognition program to recognize them. Let's take a look at the yahoo and google verification codes, but they are hard to be identified.

Goole:




Yahoo:


 


Source File: Click to download http://up.2cto.com/2012/0316/20120316111107739.rar


From the ugg column



 


It seems complicated. The background color of each verification code image is different, and the color of each verification code number is different. It seems difficult to binarization. In fact, we can easily print out the RGB values. No matter how the color of the verified digit changes, the RGB value of the digit is always less than 125.

$ Rgbarray ['red'] <125 | $ rgbarray ['green'] <125 | $ rgbarray ['blue'] <125

We can easily tell where the numbers are and where the background is.

We can find that the factor of these rules is that, in order to make the interferon of the verification code do not affect the display effect of digits, the RGB and RGB values of the interferon must be independent of each other and do not interfere with each other. As long as we understand this rule, we can easily achieve binarization.

The 120, 80,125, and other thresholds we found may differ from the actual RGB values. Therefore, sometimes, after binarization, 1 May appear in some places, and a number is displayed at a fixed position on the verification code, this interference is of little significance. However, for images with uncertain verification code positions, when we cut characters, it is likely to cause interference. Therefore, noise is removed after binarization.

II. next we will proceed to the second step to obtain noise. The principle of dryness is very simple, that is, to remove the effective values of isolation. if the noise is high and the required efficiency is high, there is a lot of work to be done here. Fortunately, we do not need to be so advanced. we can use the simplest method, if the value of a vertex is 1, it determines whether the number in the top, bottom, top, top, bottom, and right of the vertex is 1. if the value is not 1, it is considered a dry point, set it to 1 directly.

 

 

It seems complicated. The background color of each verification code image is different, and the color of each verification code number is different. It seems difficult to binarization. In fact, we can easily print out the RGB values. No matter how the color of the verified digit changes, the RGB value of the digit is always less than 125.

$ Rgbarray ['red'] <125 | $ rgbarray ['green'] <125 | $ rgbarray ['blue'] <125

We can easily tell where the numbers are and where the background is.

We can find that the factor of these rules is that, in order to make the interferon of the verification code do not affect the display effect of digits, the RGB and RGB values of the interferon must be independent of each other and do not interfere with each other. As long as we understand this rule, we can easily achieve binarization.

The 120, 80,125, and other thresholds we found may differ from the actual RGB values. Therefore, sometimes, after binarization, 1 May appear in some places, and a number is displayed at a fixed position on the verification code, this interference is of little significance. However, for images with uncertain verification code positions, when we cut characters, it is likely to cause interference. Therefore, noise is removed after binarization.

II. next we will proceed to the second step to obtain noise. The principle of dryness is very simple, that is, to remove the effective values of isolation. if the noise is high and the required efficiency is high, there is a lot of work to be done here. Fortunately, we do not need to be so advanced. we can use the simplest method, if the value of a vertex is 1, it determines whether the number in the top, bottom, top, top, bottom, and right of the vertex is 1. if the value is not 1, it is considered a dry point, set it to 1 directly.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.