Sample Code for python verification code recognition and python sample code

Source: Internet
Author: User

Sample Code for python verification code recognition and python sample code

The problem that crawlers cannot bypass is the verification code. Currently, there are about four types of verification codes:

  • Image
  • Slide type
  • Click class
  • Speech

Today, let's take a look at the image class. These verification codes are mostly combinations of numbers and letters, and Chinese characters are also used in China. On this basis, noise, interference line, deformation, overlap, and different font colors are added to increase the difficulty of recognition.

Accordingly, verification code recognition can be divided into the following steps:

  • Grayscale Processing
  • Increase contrast (optional)
  • Binarization
  • Noise Reduction
  • Skewed correction delimiter
  • Create training database
  • Recognition

Because it is experimental, the verification codes used in this article are generated by programs rather than downloading real website verification codes in batches. The advantage of doing so is that there can be a large number of datasets with clear results.

When you need to obtain data in a real environment, you can use a large-size platform to create a dataset for training.

Here I use the Claptcha library to generate the verification code. Of course, the Captcha library is also a good choice.

To generate the simplest pure-digit and non-interfering verification code, you must first set claptcha. make some modifications to the 285 rows _ drawLine of py. I directly asked this function to return None and then began to generate the verification code:

from claptcha import Claptchac = Claptcha("8069","/usr/share/fonts/truetype/freefont/FreeMono.ttf")t,_ = c.write('1.png')

Pay attention to the ubuntu font path. You can also download other fonts from the Internet. The verification code is as follows:

 

It can be seen that the verification code is deformed. Google's open-source tesserocr can be used to identify the simplest verification code.

First install:

apt-get install tesseract-ocr libtesseract-dev libleptonica-devpip install tesserocr

Then start to identify:

from PIL import Imageimport tesserocrp1 = Image.open('1.png')tesserocr.image_to_text(p1)'8069\n\n'

It can be seen that for this simple verification code, the recognition rate is very high if nothing is done. If you are interested, you can use more data for testing. I will not start it here.

Next, add noise in the verification code Background:

c = Claptcha("8069","/usr/share/fonts/truetype/freefont/FreeMono.ttf",noise=0.4)t,_ = c.write('2.png')

The verification code is as follows:

 

Recognition:

p2 = Image.open('2.png')tesserocr.image_to_text(p2)'8069\n\n'

The effect is acceptable. Next we will generate a combination of letters and numbers:

c2 = Claptcha("A4oO0zZ2","/usr/share/fonts/truetype/freefont/FreeMono.ttf")t,_ = c2.write('3.png')

The verification code is as follows:

 

3rd are lowercase letters o, 4th are uppercase letters O, 5th are digits 0, 6th are lowercase letters z, 7th are uppercase letters Z, and the last is digit 2. The human eyes are already paralyzed! However, currently, the verification code is not strictly case sensitive. Let's take a look at automatic identification:

p3 = Image.open('3.png')tesserocr.image_to_text(p3)'AMOOZW\n\n'

Of course, computers with human eyes are useless. However, tesserocr is very simple and convenient for some scenarios where the interference is small and the deformation is not serious. Then, restore row _ drawLine of the modified claptcha. py to see how the interference line is added.

 

p4 = Image.open('4.png')tesserocr.image_to_text(p4)

If an interference line is added, it cannot be completely identified. Is there any way to remove the interference line?

Although the image looks black and white, grayscale processing is required. Otherwise, the load () function is used to obtain the RGB tuples of a certain pixel rather than a single value. The process is as follows:

Def binarizing (img, threshold): "grayscale and binary processing of input image objects" img = img. convert ("L") # convert to grayscale pixdata = img. load () w, h = img. size # traverse all pixels. Black for y in range (h): for x in range (w): if pixdata [x, y] <threshold: pixdata [x, y] = 0 else: pixdata [x, y] = 255 return img

The processed image is as follows:

 

It can be seen that the image is sharpened a lot after processing, and then try to remove interference lines, a common 4-and 8-neighbor algorithm. For the so-called X-neighbor algorithm, refer to the mobile phone's nine-cell lattice input method. Press key 5 to determine the pixel, the 4-neighbor is to determine the upper and lower sides, and the 8-neighbor is to determine the eight surrounding pixels. If the number of 255 of the four or eight points exceeds a certain threshold, the noise is determined. The threshold value can be modified based on the actual situation.

Def depoint (img): "" Noise reduction of the image after binarization "pixdata = img. load () w, h = img. size for y in range (1, h-1): for x in range (1, W-1): count = 0 if pixdata [x, Y-1]> 245: # On count = count + 1 if pixdata [x, y + 1]> 245: # under count = count + 1 if pixdata [X-1, y]> 245: # Left count = count + 1 if pixdata [x + 1, y]> 245: # Right count = count + 1 if pixdata [X-1, Y-1]> 245: # top left count = count + 1 if pixdata [X-1, y + 1]> 245: # bottom left count = count + 1 if pixdata [x + 1, Y-1]> 245: # Upper right count = count + 1 if pixdata [x + 1, y + 1]> 245: # lower right count = count + 1 if count> 4: pixdata [x, y] = 255 return img

The processed image is as follows:

 

It's like ...... It's useless ?! This is true, because the width of the interference line in the example is the same as that of the number. For different interference lines and data pixels, for example, Captcha-generated verification code:

 

The source image, binarization, and interference line are removed from left to right. The overall noise reduction effect is quite obvious. In addition, noise reduction can be performed multiple times. For example, we can reduce the noise in sequence after the preceding noise reduction, and the following results can be obtained:

 

Then the results are identified:

p7 = Image.open('7.png')tesserocr.image_to_text(p7)'8069 ,,\n\n'

In addition, from the picture, the actual data color is obviously different from the noise interference line. Based on this, we can directly remove all the noise points, so we will not proceed here.

In the first article, we will first record how to perform grayscale processing, binarization, and noise reduction for images, and combine tesserocr to identify simple verification codes. The rest will be shared with you in the next article.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.