The complete process and Python implementation of character-type picture Verification code identification

Source: Internet
Author: User
Tags image processing library svm ticket

Main development environment:

    • python3.5

      Python SDK version

    • PIL

      Image processing Library

    • Libsvm

      Open-source SVM Machine Learning Library

The installation of the environment is not the focus of this article, so omitted.

6 Basic Flow

In general, the identification process for character verification codes is as follows:

    1. Prepare RAW image footage
    2. Picture preprocessing
    3. Picture character Cutting
    4. Normalization of picture size
    5. Picture character Marker
    6. Character Picture feature extraction
    7. Build the training data set for the feature and tag
    8. Training feature marker data generation recognition model
    9. Predicting new unknown picture sets using the recognition model
    10. Achieve the goal of identifying the correct character set according to "picture"
7 Material Preparation 7.1 material selection

As this article is based on primary learning and research purposes, the requirements of "representative, but not too difficult" , so directly on the Internet to find a more representative simple character verification code (feel like looking for a loophole).

Finally in a relatively old site (estimated to be decades ago the site framework) found this captcha image.

Original diagram:

To enlarge a clear picture:

This picture can meet the requirements, carefully observe the following characteristics.

characteristics of favorable identification :

    1. Made up of pure Arabic numerals
    2. Word count is 4 bits
    3. The character arrangement is regular
    4. Fonts are used in a uniform font

The above is the main reason for this verification code is simple, the subsequent code implementation will use

characteristics of unfavorable identification :

    1. Noise in the background of the picture.

Although this is a disadvantage, but the threshold is too low, just a simple way to remove

7.2 Material Acquisition

Because in doing training, need a lot of material, so it is not possible to use a manual way to save in the browser, it is recommended to write an automated download program.

The main steps are as follows:

    1. Acquisition of random image verification code generation interface via browser's grab packet function
    2. Bulk request interface to get pictures
    3. Save a picture to a local disk directory

These are some of the basic IT skills, and this article is no longer detailed.

The Code for network requests and file saving is as follows:

def downloads_pic (**kwargs):    pic_name = kwargs.get (' Pic_name ', None)    url = ' http://xxxx/rand_code_captcha/'    res = requests.get (URL, stream=true)    with open (Pic_path + pic_name+ '. bmp ', ' WB ') as F: for        Chunk in Res.iter_ Content (chunk_size=1024):            if chunk:  # filter out keep-alive new chunks                f.write (chunk)                F.flush ()        F.close ()

The loop executes n times to save the N-sheet of verification footage.

The following is a collection of dozens of libraries saved to a local file:

8 Image preprocessing

Although the current machine learning algorithm is quite advanced, but in order to reduce the complexity of the training, while increasing the recognition rate, it is necessary to pre-processing the image, making it more friendly to machine recognition.

The processing steps for the original footage are as follows:

    1. Reading raw picture footage
    2. Value color picture two to black and white pictures
    3. Remove background noise.
8.1 Two value picture

The main steps are as follows:

    1. Convert RGB color map to Grayscale
    2. Convert a grayscale graph to a two value graph according to the set threshold value
Image = Image.open (img_path) imgry = Image.convert (' L ')  # converted to Grayscale table = get_bin_table () out = imgry.point (table, ' 1 ')

Converted from PiL to two value picture: 0 for Black, 1 for white. After 6937 , the pixels of the pixel with the noise point are output as follows:

1111000111111000111111100001111100000011111011101111011101111101111011110011011110011100111101111010110110101011011101111 1011111111101101011111101011111111011111101000111110111001111110011111111101111110011101111100000111111100101111101111111 0111000111111110101101011011111101111111011110111111111011110111101111110111111101111011110111001111011110111111011100111 0000111111000011101100001110111011111

If you are short-sighted, and then away from the screen, you can vaguely see the skeleton of the 6937 .

8.2 Removing noise points

After converting to a two-value picture, you need to clear the noise. The material selected in this article is simple, most of the noise is also the simplest kind of outlier , so you can detect these outliers to remove a lot of noise.

about how to remove more complex noise and even interference lines and color blocks, there is a relatively mature algorithm: flood filling method Flood fill , after the time of interest can continue to study.

In order to simplify the problem, simply use a simple way to solve this problem:

    • Count the black dots inside the nine Gongge around a black dot.
    • If the black point is less than 2, it proves that the point is an outlier and then gets all the outliers.
    • Bulk removal of all outliers at once.

The following is a detailed introduction to the specific algorithm principle.

Divide all pixels into three main categories.

    • Vertex a
    • Non-vertex boundary point B
    • Internal Dot C

The kinds of points are as follows:

which
    • A-class point calculates 3 points adjacent to each other (as shown in the red box)
    • The B-Type point calculates 5 points adjacent to the perimeter (as shown in the red box)
    • The C-type point calculates the neighboring 8 points (as shown in the red box)

Of course, because the datum points are in different directions in the calculation area, A and B points are subdivided:

    • Class A points continue to be subdivided into: top left, lower left, upper right, lower right
    • Class B points continue to be subdivided into: top, bottom, left, right
    • Class C points are not subdivided

These subdivisions will then become the guideline for subsequent coordinates.

The python implementation of the main algorithm is as follows:

Def sum_9_region (IMG, x, y): "" "9 Neighborhood box, the field box centered on the current point, the number of black dots:p Aram x::p Aram y:: Return:" "# TODO to judge the length of the picture Lower width Cur_pixel = Img.getpixel ((x, y)) # The value of the current pixel width = img.width height = Img.height if Cur_pixel = = 1: # AS If the current point is a white area, the neighborhood value return 0 if y = = 0: # first row if x = = 0: # top left Vertex, 4 neighbor # Center point next to 3 point sum = Cur_pixel + Img.getpixel ((x, y + 1) + Img.getpixel ((x + 1, y) +                   Img.getpixel ((x + 1, y + 1)) return 4-sum elif x = = width-1: # top right vertex sum = Cur_pixel + Img.getpixel ((x, y + 1) + Img.getpixel ((x-1, y)) + Img.getpixel ((x                   -1, y + 1)) return 4-sum else: # Most non-vertex, 6 neighbor sum = Img.getpixel ((x-1, y))                   + Img.getpixel ((x-1, y + 1) + Cur_pixel + img.getpixel ((x, y + 1)) + Img.getpixel ((x + 1, y)) + Img.getpixel ((x + 1, y + 1)) return 6-sum elif y = = height-1: # The bottom line                   if x = = 0: # Left bottom Vertex # 3 points next to center point sum = cur_pixel + Img.getpixel ((x + 1, y)) + Img.getpixel ((x + 1, y-1) + Img.getpixel ((x, y-1)) return 4-sum El if x = = width-1: # right bottom vertex sum = Cur_pixel + Img.getpixel ((x, y-1) + img             . GetPixel ((x-1, y)) + Img.getpixel ((x-1, y-1)) return 4-sum else: # top non-vertex, 6 neighborhood                   sum = Cur_pixel + img.getpixel ((x-1, y)) + Img.getpixel ((x + 1, y)) + Img.getpixel ((x, y-1)) + Img.getpixel ((x-1, y-1) + Img.getpixel ((x + 1, y-1)) return 6-sum else: # y not at boundary if x = = 0: # left non-vertex sum = Img.getpixel ((x,                 Y-1))  + Cur_pixel + img.getpixel ((x, y + 1) + Img.getpixel ((x + 1, y-1))  + Img.getpixel ((x + 1, y) + Img.getpixel ((x + 1, y + 1)) return 6-sum elif x = = Width-1: # Right non-vertex # print ('%s,%s '% (x, y)) sum = Img.getpixel ((x, y-1) + CU R_pixel + Img.getpixel ((x, y + 1) + Img.getpixel ((x-1, y-1)) + Img.getpixel ((x-1, y)) + Img.getpixel ((x-1, y + 1)) return 6-sum else: # with 9 field bar The sum of the pieces = Img.getpixel ((x-1, y-1)) + Img.getpixel ((x-1, y)) + IMG.GETP Ixel ((x-1, y + 1) + Img.getpixel ((x, y-1) + Cur_pixel + img.ge                   Tpixel ((x, y + 1)) + Img.getpixel ((x + 1, y-1) + Img.getpixel ((x + 1, y)) + Img.getpixel ((x + 1, y + 1)) Return 9-sum

Tips: This place is quite a test of the attention and patience of the people, the workload of this place is quite large, it took half the night time to complete.

Calculate the surrounding pixel black point of each pixel (note: PiL conversion of the image Black point value is 0), only need to filter out the number of 1 or 2 points of the coordinates is an outlier . This method of judgment may not be accurate, but it basically satisfies the requirements of this article.

The pre-processed picture looks like this:

Comparing the original image at the beginning of the article, those outliers are removed, and relatively clean captcha images have been generated.

9 Picture Character Cutting

Because the character verification code picture Essence can be seen by a series of single-character picture splicing, in order to simplify the study of objects, we can also decompose these images to the atomic level , that is, a picture containing only a single character .

Therefore, our research object is "the combination object of n string " to the "10 kinds of Arabic numeral" processing, greatly simplifies and reduces the processing object.

9.1 Segmentation algorithm

Real-life character verification codes produce strange and varied distortions and variants. The algorithm for character segmentation is not a very common way. This algorithm is also required by developers to carefully study the character images to be identified to develop.

Of course, the subjects selected in this article are as simple as possible to simplify this step, and the following will be introduced slowly.

Use the image editing software (PhoneShop or other) to open the CAPTCHA image, zoom in to the pixel level, and observe some other parameter features:

The following parameters can be obtained:

    • The entire image size is 40*10
    • Single character size is 6*10
    • The left and right characters are 2 pixels apart from the left and right edges
    • Characters close to the edge (i.e. 0 pixels apart)

This makes it easy to navigate to the pixel area that each character occupies in the entire picture, and then it can be split, with the following code:

def get_crop_imgs (IMG): "" "according to the characteristics of the    picture, to cut, this will be based on the specific verification code to work. # See schematic    :p aram IMG:    : return:    " ""    child_img_list = [] for    I in range (4):        x = 2 + I * (6 + 4)  # see schematic        y = 0        child_img = Img.crop (( x, y, x + 6, y +))        child_img_list.append (child_img)    return child_img_list

Then you can get the atomic-level picture elements that were cut:

9.2 Summary of Contents

Based on the content of this section of the discussion, I believe you have learned that if the verification code interference (distortion, noise, interference color block, interference line ...) If you do not have strong enough, you can get the following two conclusions:

    • 4-bit characters and 40,000-bit characters have little difference in captcha

    • The verification code of the combination of pure numbers and numbers and letters is not very different
      • Pure numbers. The number of categories is 10

      • Plain Letter
        • Case insensitive. The number of categories is 26
        • is case sensitive. The number of categories is 54
      • A combination of numbers and case-sensitive letters. The number of categories is 64

It is not significant to increase the degree of difficulty without forming a magnitude or geometric level , but only when the linear finite level increases the computational amount.

10 Dimensions to one

The size of the research object selected in this article is the Unified state: 6*10 specifications, so this section does not require additional processing. However, some code that has been distorted and scaled will be a difficult part of image processing.

11 Model Training Steps

In front of the link, has completed the processing and segmentation of a single image. The training of the recognition model is started later.

The entire training process is as follows:

    1. Large amount of pre-processing and cutting to atomic level picture material preparation
    2. Artificially classify footage images, i.e.: tag
    3. Defining the recognition characteristics of a single picture
    4. Using SVM training model to train the signature file of the tag, get the model file
12 Material Preparation

This article re-downloads the same pattern of 4 numbers in the training phase of the verification picture total: 3000 photos. The 3000 images are then processed and cut to get 12000 atomic images.

In these 12000 images remove some of the interference material that will affect the training and recognition of the strong interference, after cutting the following:

13 Material Markers

As a result of this recognition method used in this paper, the machine at the beginning is not equipped with any concept of numbers. So it is necessary to artificially identify the material, tell the machine what kind of picture content is 1 ....

This process is called "tagging."

The specific way to tag is:

    1. Create a directory for each number 0~9, and the directory name is the corresponding number (equivalent to the label)

    2. artificially determine the contents of a picture and drag it into the specified number directory

    3. 100 or so pieces of material in each directory

      In general, the more footage is tagged, the better the resolution and predictive power of the trained model. For example, in this article, when the marker footage is more than 10, the new test image recognition rate is basically zero, but when you reach 100, you can achieve a recognition rate of nearly 100%

14 Feature Selection

For a single character picture after cutting, the pixel-level magnification is as follows:

From the macro point of view, the essence of different digital pictures is to fill the black according to certain rules in the corresponding pixel points, so these features are finally around the pixel point.

The character picture is 6 pixels wide and 10 pixels high , so it's theoretically possible to define 60 features in a nutshell: 60 pixels above the pixel value. However, it is obvious that such high dimensions will inevitably result in excessive computational capacity and can be properly reduced.

By referring to the corresponding literature [2], another simple and crude feature definition is given:

    1. The number of black pixels on each line can be 10 characters
    2. The number of black pixels on each column, you can get 6 characters

Finally, we get a set of 16-dimensional features, the implementation code is as follows:

def get_feature (IMG): "" "    gets the eigenvalues of the specified picture,    1. According to the pixel point of each row, the height is 10, there are 10 dimensions, then 6 columns, a total of 16 dimensions    :p Aram Img_path:    : Return: A list of dimension 10 (height) ""    "    width, height = img.size    pixel_cnt_list = []    height = Ten for    y in Range (height):        pix_cnt_x = 0 for        x in range (width):            if Img.getpixel ((x, y) = = 0:  # black dot                pix_cnt _x + = 1        pixel_cnt_list.append (pix_cnt_x) for    x in range (width):        pix_cnt_y = 0 for y in        range (height) :            If Img.getpixel ((x, y)) = = 0:  # black dot                pix_cnt_y + = 1        pixel_cnt_list.append (pix_cnt_y)    return Pixel_cnt_list

The image footage is then characterized and a set of vector files with eigenvalues and tagged values is generated in the format specified by LIBSVM . Examples of the content are:


The description is as follows:

    1. The first column is the label column, which is the tag value of this picture person, followed by other numerical 1~9.
    2. followed by 16 sets of eigenvalues, preceded by an index number, followed by a value
    3. If there are 1000 training pictures, then 1000 rows of records will be generated.

Interested in this file format, students can go to LIBSVM website to search more information.

15 Model Training

After this stage, because this article directly uses the open source libsvm scheme, belongs to the application, therefore here the content is relatively simple. You only need to enter the signature file and then output the model file.

We can search for a lot of relevant Chinese materials [1].

The main code is as follows:

Def train_svm_model (): "" "    train and generate model file    : return:" ""    y, x = Svm_read_problem (svm_root + '/train_ Pix_feature_xy.txt ')    model = Svm_train (y, x)    Svm_save_model (Model_path, model)

Note: The resulting model file name is called svm_model_file

16 Model Testing

After training to generate the model, you need to test the model with a new, tagged image outside of the training set as a test set .

The test experiments in this article are as follows:

    • Model testing with a set of 21 images all labeled 8
    • Test pictures generate tagged feature filenames called last_test_pix_xy_new.txt

In the early training set sample only more than 10 images per character, although the training set samples have a good degree of distinction, but for the new sample test set basic indistinguishable ability, recognition is basically wrong. There is a better case for a gradual increase in the sample of the training set labeled 8:

    1. To about 60, the correct rate is about 80%.
    2. When we reach 185, the accuracy rate is basically 100%.

With this model hardening method of number 8, we continue to strengthen the model training of other numbers in digital 0~9, and finally we can achieve the recognition rate of almost 100% for all digital images. In this example, basically the training set for each number is around 100, so it can reach a 100% recognition rate.

The Model test code is as follows :

Def svm_model_test (): "" "    test Model Using test set    : return:" "    yt, xt = Svm_read_problem (Svm_root + '/last_ Test_pix_xy_new.txt ')    model = Svm_load_model (model_path)    P_label, P_ACC, P_val = Svm_predict (YT, XT, model) # P_label is the result of the recognition    cnt = 0 for    item in P_label:        print ('%d '% item, end= ', ')        cnt + = 1        if cnt% 8 = = 0:            Print (")

At this point, verification of the recognition work is a complete end.

17 Complete Identification process

In the previous session, the relevant toolset for CAPTCHA identification is ready. The dynamic verification code on the specified network is then continuously recognized, and a little bit of code is written to organize the process to form a stable, black-box authentication Code recognition interface.

The main steps are as follows:

    1. Pass in a set of CAPTCHA pictures
    2. Preprocessing images: denoising, two-value, etc.
    3. Cut into 4 ordered single-character images
    4. Use model files to identify 4 images separately
    5. Stitching the recognition results
    6. return recognition results

Then, in this paper, the HTTP interface of a network verification code is requested, the verification code image is obtained, the result is recognized, and this verification picture is saved as the name. The effect is as follows:

Apparently, the recognition rate of almost 100% has been reached.

In the absence of any optimization of this algorithm, the current mainstream configuration of the PC running this program, can achieve 200ms recognition of a (very large time-consuming from the network request blocking).

18 Efficiency optimization

Better efficiency can be achieved at a later stage through an optimized approach.

Software level optimization

    1. Make the network request part of the picture resource asynchronous non-blocking mode
    2. Multi-process parallel operation with many core CPUs
    3. Carefully select and experiment on the image features to reduce the dimensions

It is expected to be up to 1s to identify 10 to 100 verification codes.

Hardware-Level optimization

    1. Rudely increases CPU performance
    2. To increase the running machine rudely

Basically, 10 4 core machines are simultaneously requested, and the conservative estimation efficiency can be increased to 1s identification of 10,000 verification codes.

19 Internet Security Alert

What are the security implications if the verification code is identified?

Once you've got a sense of the efficiency of recognition through the last section, you'll have a new perspective on this scenario:

    • 12306 train Ticket Network, during the Spring Festival 8:00 a certain train out of 500 tickets, 1s all was robbed, and finally found that the normal demand of the people can't grab tickets, but the cattle have a big ticket
    • XXX Mobile website, 10:00 to start snapping up activities, waiting for a long time of countless you are drubbing, but the same cattle have a lot of goods

In the future regardless of the formalities behind the shady, in all formalities legal circumstances, as long as the technical means to identify the verification code, and then through the computer powerful computational power and automation capabilities, will be a large number of resources to grab a few cattle in the hands of the technology is completely feasible.

So in the future, we can not get tickets to the bad times, you may continue to scold 12306, but do not scold it has a shady, but scold them it technology is not fine.

About a verification code failure, which is equivalent to a system without a verification code, and no other wind control strategy, then the system for the Code program is completely forcefully.

For details, please refer to:

The small security vulnerability of Web application system and the corresponding attack method

http://www.cnblogs.com/beer/p/4814587.html

From the above example, you can see:

    1. There are some Web applications that do not even have a verification code, they can only be trampled
    2. Even if the Web application system has a verification code but the difficulty is not enough, can only be trampled

So, although this piece is small, the security problem cannot be neglected.

20 Positive Application Scenarios

In fact, this paper introduces a simple OCR technology implementation. There are some applications that are good at the same time that are positive and progressive:

    • Identification of bank card number
    • ID number identification
    • License plate number Recognition

These scenarios have features that are similar to the material studied in this article:

    1. Single font
    2. Character is a simple combination of numbers or letters
    3. The arrangement of words is standardized and unified

So if the original data acquisition compared to the standard case, it should be difficult to identify.

21 Summary

The complete process and Python implementation of character-type picture Verification code identification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.