Operating system: WIN10 1709 X64
Python version: 3.6.5
Dependent modules: PIL, TESSEROCR.
It is necessary to note that when you install the CAPTCHA identification module on a Windows system PowerShell through the PIP3 install TESSEROCR, you need to first install the tesseract ( An open source OCR(Optical Character recognition, optical character recognition) engine developed by Google maintained by HP Labs, with Microsoft Office Document Imaging (MODI), we can constantly train the library to make the image convert text to the ability to continuously enhance. ) executable file.
For example, we are often asked to enter this type of simple letter, the background contains a lot of miscellaneous lines of verification code, as shown in:
We save the code as the directory where the local codes are located, named: Test.png.
is a code example that is identified directly with the corresponding module:
Import tesserocrfrom PIL import imageimage=image.open (' Test.png ') image.show () #可以打印出图片 for previewing print ( Tesserocr.image_to_text (image))
The original image size is small, very few cases if not normal recognition, you can use the Picture Processing tool PIL module to zoom in and save the picture. In this example, run the above code directly, the result is "Vhihi", even if the naked eye visible more clear verification code, if the picture is not processed directly to TESSEROCR resolution, it may also be very low recognition rate.
In general, we also need to do some additional image processing, such as turning to grayscale, binary and so on.
The image can be converted to grayscale by using the CONVERT () method of image to pass the parameter L.
Image=image.convert (' L ')
Image.show ()
Pass in 1 to complete the two value, as follows:
Image=image.convert (' 1 ')
Image.show ()
Of course, we need to specify the threshold value according to the actual situation of the picture, for example, we set the threshold value to 80, first to grayscale, then binary, the code is as follows:
ImportTESSEROCR fromPILImportImageimage=image.open ('Test.png') Image=image.convert ("L") Threshold=80Table=[] forIinchRange (256): ifI <threshold:table.append (0)Else: Table.append (1) Image=image.point (table,'1') image.show ()Print(Tesserocr.image_to_text (image))
The image after processing is observed as right:
Although the image has been converted to grayscale, and most of the clutter is filtered, the image key pixel is missing seriously and the recognition result is naturally unsatisfactory, as a result: "VH."
At this point, according to the actual situation of the picture, the human Adjustment program preset threshold to 130, and then observe: this time the picture conversion effect is significant, we again look at the recognition results,"Vhru", with the naked eye to observe the indistinguishable, meet the requirements.
Visible Verification Code identification in addition to using a good identification module, but also need to introduce PIL (Image processing module) for picture preprocessing, pre-processing thresholds and other settings also have skills, different parameters set, will completely affect the final recognition rate.
Many of the real-world Web site Verification code is far more complex than the example, especially the 12306 ticket purchase site verification code, so that the behavior verification code began to develop rapidly, the naked eye to distinguish between the extremely difficult, which requires us to the verification code recognition technology to continue to improve, to break through the site of the gradual upgrading of anti-crawler mechanism.
Python TESSEROCR Module Use example