Principle of simple OCR (www.team509.com)

Last Update:2018-12-05 Source: Internet

Author: User

Tags 0xc0

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many websites often use a verification code technology to prevent DoS attacks when receiving HTTP input. Simply put, before or at the same time, the server sends a small image to the client, which is usually composed of random numbers or letters (in rare cases, there are other characters) and requires that the number be entered correctly in the HTTP form. invalid input will cause data submission failure. This mechanism can effectively prevent or reduce the negative effect of some login/post/vote automatic machines on the website.

Author: doublelee
Date: 2005-5-5

Overview:

Early authentication images are stored on the server in the form of a picture library, and each time a random image is selected and sent to the client. An easy attack method for such authentication mechanism is that attackers can use automatic machines to download and manually identify all the images and store them in the attacker's database, you can query the corresponding authentication code in the database based on the obtained image or even its checksum. For a 4-digit scenario, attackers only need to store 10 thousand images, which is actually a small workload.

For the above attack methods, many JSP, PHP, and CGI scripts can dynamically generate such authentication images, that is, the authentication code is randomly generated first, and then the image is dynamically generated based on the authentication code. In this way, the images generated by the same authentication number are completely different, which invalidates the attack method for storing the image library. However, due to technical restrictions, the font deformation of dynamically generated images is usually poor, and most of them do not even have a deformation mechanism. In this case, writing a program to recognize such images becomes possible.

This article takes a website in China as an example to discuss the idea of writing such an automatic image recognition tool.

Body:

First, the image format is similar to BMP, JPG, and GIF. You can use open-source programs to convert all images into a uniform format. Here, only the BMP format is taken as the target, because it is the simplest and easy to process format.

Now let's analyze the composition of the image. After sending 50 requests, we get 50 images with the same number as 7208.

After analysis, the image content is 4 digits without letters or other symbols. The image format is BMP. All images have a uniform size, width, and height. After querying the BMP file format document, it is found that the color palette is identical, with 2 bytes representing a pixel. In addition, we also found that all the numbers in the image are in a fixed position, fixed size, fixed font, and changed only some random black spots in color and background. In addition, the color may be due to the existence of blind users, and there is no too bright color, that is to say, basically the color in the digital contour is based on the dark gray, the change is not big. The same is true for spots in the background. The number is about 40 pixels (17*60 = 1020 pixels in total)

Further analysis is performed to determine the position and pixel of each number. Because the BMP file format is very simple, this work can be done very quickly. The data obtained is as follows:

// The image height is 17, the width is 60, and the following X and Y are the first Quadrant Coordinates.
// Each number has a height of 13 and a width of 10. The gap between words is 3.
// The starting coordinates are in sequence (x, y) =)
// Each pixel occupies 2 bytes

Based on the above data, you can easily locate each number in the program and extract the data information of all pixels representing the number in this position, this information can exist in an array of 2*13*10 = 260 bytes.

The following question is how to accurately determine the number represented by the 260 bytes of data, which involves a fuzzy match problem. Fortunately, in this example, there are only 10 matching targets, and the impact of interference elements is not great. After several explorations, I think it is enough to generate 10 standard matching targets. This Matching target is called a number template (Module). A total of 10 matching targets represent 0 ~ 9. Typical gray distribution of numbers. Each digital template also occupies 260 bytes. In this way, the 260 bytes of data obtained each time will be compared with the data. The minimum difference is the matching result.

There are two problems: 1. Where does the template file come from? 2. What formula is used to compare functions?

For the first problem, I don't want to manually search and fill in 260 bytes in one byte. It's 2600 times, too tired! Then try to write program statistics. At the beginning, I used to count all the same numbers and take the average value. For example, if each 0 is displayed, assuming that K 0 has been counted before, the value of the corresponding byte in the template is the average value of the first K digits 0, then after seeing the picture P0 (k + 1) of the k + 1 0, the correction template is
M0 (k + 1) = (m0 (k) * k + P0)/(k + 1)
However, the problem was immediately discovered. For different colors, P0 (k) varies greatly, and the average value does not reflect the problem. More seriously, as an integer, the formula for calculating the average value containing Division is not easy to use, and the error is very large. After some rounds of calculation, the template data becomes messy, new data is also hard to work.
After careful observation of the data, it is found that the value of the high byte is generally less than 0xc0 at the dark pixel, but the specific value is determined by the color, which is very different, however, when a high byte of a light pixel is greater than 0xc0, the actual impression on the image is determined only by the gray level of the pixel. The specific color does not matter. Therefore, I made a correction, that is, taking 0xc0 as the threshold, taking 1 for a value greater than this number, taking 0 for a value smaller than this number, and correcting it to the corresponding byte of the template with addition, add module_weight to each template to record how many source images have been recorded. In addition, if the number is 255, no statistics will be made. Make sure that the data weights of each template are the same. It was proved effective later.

The second problem is that I want to use the sum of the absolute values of each byte difference, that is
SIGMA (ABS (X (I)-M (I) I = 259 ,...

Here, X (I) indicates the ith byte of the image to be judged, and M (I) indicates the ith byte value in the template.
Later we found that the value of this number is greatly affected by the color, so we need to increase the effect of each pixel difference on the overall comparison function. To achieve this goal, in addition to modifying the template file, you must also adjust the judgment formula. If you have learned some basic knowledge about probability theory, it is a good choice to calculate the second-order center moment, that is, variance.

Now the formula is adjusted
SIGMA (X-m) ^ 2)

All right, the basic problems have been solved. Try adjusting the program. Note that each pixel is 2 bytes, that is, a short-type data. The square of the difference between two short-type data should be recorded using int-type variables, so that 260 int variables will overflow. To make the program simple and not process big integers, I separate the two bytes of each pixel and use short to record the square of the second pixel. The sum of the 260 unsinged short variables, haha, the unsigned int does not overflow.

Done!

Program:

Below is the source code of C, less than 200 lines, processing the BMP file format, including the template generation function, that is, the study Function

// Configure //-------------------------------------------------------------------------------------------
// File name: myocr. c
// Special crafted for xxxx.com's image.
// By doublelee@etang.com

# Include
# Include

# Define mod_file "digit. Mod"

Unsigned char map [4] [260]; // 4 digits in each figure, represented by 260 bytes

Unsigned char module [10] [260]; // 10 templates, representing 0 ~ 9, also uses 260 bytes.

Unsigned char Buf [0x083a]; // file size, fixed

Unsigned char module_weight [10]; // The number of training times. If the value increases to 255, it will not increase.

Unsigned int diff, tmpdiff; // diff indicates the similarity between the number and the template, which is determined by sigma (X-m) ^ 2.
// Tmpdiff indicates the most similar value till now. Smaller, more similar

Char value [4] = {0}; // Save the result. The value must be between 0 and 0 ~ Between 9

Void readmap () // read image data from the Buf's location
{
Int I; // number I
Int X, Y; // coordinates
For (I = 0; I <4; I ++)
For (x = 0; x <20; X ++)
For (y = 0; y <13; y ++)
{
Map [I] [x + y * 20] = Buf [0x42 + (7 + 13 * I + 60 * (Y + 1) * 2 + x];
// The image height is 17, the width is 60, X, and Y are the first Quadrant Coordinates.
// Each number has a height of 13 and a width of 10. The gap between words is 3.
// The starting coordinates are in sequence (x, y) =)
// Each pixel must contain 2 bytes. Therefore, the formula 0x42 indicates the length of the BMP header.
// In fact, it can be improved. Each pixel only determines the high byte (which has a large relationship with the gray level ?), Accuracy should be improved
// The current test results are satisfactory.
}
Return;
}

Void outputvalue () // output result
{
Int I;
For (I = 0; I <4; I ++)
Fprintf (stdout, "% d", value [I]);
Return;
}

Int read_mod ()
{
File * FP;
Fp = fopen (mod_file, "R + B ");
If (! FP)
{
Memset (module, 0, 10*260 );
Memset (module_weight, 0, 10 );
Return 0;
}
Fread (module, 10,260, FP );
Fread (module_weight, 10, 1, FP );
Fclose (FP );
Return 1;

}

Int write_mod ()
{
File * FP;
Fp = fopen (mod_file, "W + B ");
If (! FP)
Return 0;
Fwrite (module, 10,260, FP );
Fwrite (module_weight, 10, 1, FP );
Fclose (FP );
Return 1;

}

Int checkbmp (File * FD)
{
Int X;
If (! FD)
Return 0;
Fread (& X, 1, sizeof (INT), FD );
If (X! = 0x083a4d42) return 0;

Fseek (FD,-1, seek_end );
X = ftell (FD );
If (x + 1! = 0x083a)
Return 0;
// The above is the basic check. In fact, you can check more BMP header structure fields.
Rewind (FD );
Fread (BUF, 1, 0x083a, FD );

Readmap ();
Return 1;
}

Void checkdigit () // judgment Function
{

Int I, J, K;
Unsigned char X;

For (I = 0; I <4; I ++)
{
Diff = 0;
Tmpdiff = 0 xffffffff;

For (j = 0; j <10; j ++)
{
Diff = 0;
For (k = 0; k <260; k ++)
{
X = module [J] [k]> map [I] [k]?
(Module [J] [k]-map [I] [k]):
(Map [I] [k]-module [J] [k]);
Diff + = x * X;
} // Calculate sigma (X-m) ^ 2)
If (diff {
Value [I] = J;
Tmpdiff = diff;
}

} // Print 10 templates
} // Solve all four numbers

}

Int Study () // training function
{
Int I, J;

For (I = 0; I <4; I ++)
{
If (module_weight [value [I]! = 0xff)
{
For (j = 0; j <260; j ++)
{
Module [value [I] [J] + = (Map [I] [J]> 0xc0? 1-0 );
}
Module_weight [value [I] ++; // after training
}
}

Return 0;
}

Int main (INT argc, char ** argv)
{
File * FD;
Int I, RET;

Read_mod ();

If (argc = 1)
FD = stdin;

If (argc> 1)
{
FD = fopen (argv [1], "rb ");
If (! FD)
{
Fprintf (stderr, "Open File % s failed", argv [2]);
Return 0;
}
}

If (argc> 2) // followed by a training number
{
For (I = 0; I <4; I ++)
{
Value [I] = argv [2] [I]-0x30;
If (value [I]> 9 | value [I] <0)
{
Fprintf (stderr, "Wrong Number % s", argv [2]);
Return 0;
}
}
Checkbmp (FD );
Ret = Study ();
Write_mod ();
Return ret;
}

If (checkbmp (FD ))
{
Checkdigit ();
Outputvalue ();
}
If (FD) fclose (FD );
Return 1;
}
// Configure //----------------------------------------------------------------------------------------------------------------

This program maintains and relies on a template file digit. Mod. If there is only one parameter, it is considered as a BMP file name, And the number represented by it is determined based on the template file, for example

Myocr 2110.bmp

If there are two parameters, it is considered that the learning mode is enabled. Modify the template file digit. MOD based on the BMP file represented by argv [1] and the number represented by argv [2. For example

Myocr 2110.bmp 2110

I wrote a BAT file to execute all the training. After multiple training sessions, all the module_weight files in digit. Mod become 0xff.

The above is an image recognition program for a specific website. By slightly adjusting the parameters, you can recognize many such images.

Conclusion:

Looking at the entire authentication code technology system, we can basically divide it into static image library technology and dynamic generation technology.

Static Image libraries can indeed prevent program recognition through strong deformation technology, but a small number of images are easy to be lifted, and generating such image libraries requires manual participation. Otherwise, it is difficult to ensure the recognition level. Therefore, the cost is high.

Due to the complexity of the program, dynamic image technology also takes into account the identifiability. It often fails to achieve good deformation or even no deformation, which provides a pleasant opportunity for the program to implement attacks, attacks against such dynamic images can be summarized as one sentence, "What is generated by a program is easily identified by the program."

Fact-sheet authentication technology does play a role in preventing Web Robots. However, due to its implementation limitations, it cannot fully rely on this technology to ensure the security of websites.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More