Intermediary trading http://www.aliyun.com/zixun/aggregation/6858.html"> SEO diagnostic Taobao customer hosting technology hall
These days for some needs, procedures required to identify a site's verification code, so devote themselves to some, quite experience, hereby share.
Editor: learning network to make money, website operations, network marketing, please log in --Selected Network http://www.xuanxue.com
CAPTCHA IDENTIFICATION This job is not for impetuous people, it requires adequate skills and patience. Due to the special nature of this technology, any one of the captchas that have been publicly identified will quickly expire, and the relevant website will quickly change the captcha. So this article only introduces the principle of identification and recognition of the simplest verification code.
---------------------------
First of all, I choose one of the most simple verification code, find it, select the verification code of the challenge net as an example. Just open the challenge net an article, find the verification code to comment, view its properties, to generate its generated address "http://tiaozhan.com/checkcode.php".
Obviously, this is the simplest kind of verification code: a fixed background color, character color, font, hyphen coordinates are fixed. For this type of verification code, we only need to sample each number to establish a standard library, and then apply the standard library one by one, you can easily do 100% recognition.
Use ImageCreateFromPNG function to retrieve the image, and then use the imagecolorat function to obtain the color value of each coordinate point, and the first point of the color as the background color. Then according to the size of the image to draw a table, if the unit corresponding to the coordinates of the same color and background color, does not display any content; the other hand, the black block. So we get this decomposition diagram:
It can be observed that the y coordinate of the area occupied by the number is 6-15, and the x coordinate of the area occupied by the four numbers is 3-10, 12-19, 21-28 and 30-37, respectively.
So we set up to create 10 two-dimensional arrays ($ arr_eg [0] - $ arr_eg [9]) with 0-9 samples. Each element of this array corresponds to each coordinate of this number area. If the coordinate color value and background The same value of 0, otherwise 1. This is our standard library.
Recognition, the same access to four arrays, one by one with the standard array, you can accurately identify the four numbers.
At the same time attach this verification code identification program for your study. (demo.php is the program; arr.php is the standard library)
Attachment: secode.rar (1688 bytes)
-----------------------
Although the above example is simple, but the basic principles have been introduced clearly, that is, sampling -> Set Standard Library -> Application -> Control Standard Library -> Identification.
However, in practice, it is often not so simple to meet. For example, here is a slightly more complex type of verification code. Its background and characters are not solid colors. There are many interference points, but their coordinates are fixed.
First we denoise it. Is the first of each character segmentation, according to the frequency of occurrence of its main color value (character color value), and then remove the difference is greater than a certain degree of coordinates, filtered to get the coordinates of the target array, and then the same standard library Control. However, this case is not exactly match, we can only choose the highest coincidence yesterday as a result. After practice, the recognition rate can reach 99%.
A little harder, is the following: The use of the discoloration, interference points, interference lines, displacement and other means for interference.
Unlike the previous one, where each character is located is indefinite, it is up to us to determine its own position and to cut out the small, fixed size of the character. First remove all noise and interference lines (characters are "accidentally injured" after removal, usually 1-3 pixels short), get a cleaner image, then scan it with a horizontal and a vertical line More vivid, specifically how to achieve their own thinking), the sweep did not appear the color of the vertical and horizontal lines all removed, the scope of the analysis reduced to a smaller area. And then use the vertical line scan, according to the color of the emergence or not, but also to get 5 small areas, each small area and then horizontal scan, remove the blank, get the target area. Get the target area 8630.html "> sometimes than the standard area to find ways to make up small, and then control, according to the highest coincidence rate of the principle of the results, the final recognition rate of 90%.
A little harder. It is the hardest kind of research I have ever done. As shown, this code in addition to interference background, the location of each character, size, and even fonts are uncertain. Fortunately, there is no adhesion between each character. It is easier to cut words without sticking (of course, more difficult than the ones above). After the word is cut, the block size is indeterminate, so it is hard to set up a standard library. I can only think of this way: cut the word after the block with horizontal or vertical scanning, according to the color coordinates of the change rule to determine the results. Currently in the experiment, the characters can not be recognized, the recognition rate is not ideal.
---------------------
CAPTCHA identification This problem is a problem in the field of artificial intelligence and computer vision. As a cracker, will always be at a disadvantage, and this technology with some unfair, there is not much research exchange, so to do well, is very difficult. And personally, for many OCR technologies, the understanding is very limited, not dare to scribble here, only to use limited knowledge, initiate only.