Using Open source program (IMAGEMAGICK+TESSERACT-OCR) to realize image verification code recognition

Source: Internet
Author: User
Tags imagemagick truncated

--------------------------------------------------low-key split-line---------------------------------------------------

Linux has two important programming principles, even the philosophy of design, that is, the modular principle (the simple parts are pieced together using a simple excuse) and the combination principle (consider stitching combinations when designing). Under Linux there are countless small programs, small size, simple function. But when we put them together in a certain way, they are almost omnipotent. One of the great advantages of the command line is that it is convenient to combine. Imagine that you are dealing with 10,000 text files and replacing some of them, and if you are using the graphical interface of Word, I am afraid no one can do it down.
Today we are going to use two open source software: imagemagick+tesseract-OCR.

--------------------------------------------------ImageMagick---------------------------------------------------

The first is a brief introduction (the original English language originates from the official website):

ImageMagick is a software for creating, editing, and combining bitmaps. It is capable of reading, writing, and converting images over hundreds of formats.

In addition, ImageMagick has an excuse for mainstream programming languages, including G2F (Ada), Magickcore (c), Magickwand (c), Chmagick (Ch), imagemagickobject (COM +), Magick + + (c + +), Jmagick (Java), L-magick (Lisp), Nmagick (Neko/haxe), magicknet (. NET), Pascalmagick (Pascal), Perlmagick (Perl) , Magickwand for PHP (PHP), Imagick (PHP), Pythonmagick (Python), Rmagick (Ruby), and Tclmagick (TCL/TK). Of course, you can also combine it with other programs in the command line way.

ImageMagick is an open source software that is published in two ways, binary files and source code that can be run. You can freely use, copy, modify, and distribute it in public and private programs. It is based on the Apache 2.0 style protocol release.

Second, seemingly ImageMagick's official website is the Kungfu wall (this is a purely technical site Ah!). ), so we can't go directly to get the program, here is the domestic download.

Finally, the installation, nothing to say, the simplest way next can, of course, you can also change the installation directory what. Rest assured, there is no bundle Baidu tool bar ~

--------------------------------------------------tesseract-OCR-------------------------------------- -------------

First introduce the next tesseract-OCR, the usual, the original English origin from the official website

tesseract-OCR is an OCR(Optical Character recognition, optical character recognition) engine originally developed and maintained by HP Labs in 1985-1995 years, Now it's Google tube.

The tesseract-OCR engine was one of the top three engines in the 1995 UNLV accuracy test. There was little change between 1995 and 2006, but it could still be one of the most accurate open source OCR engines now. It will read binary grayscale or color images, and output text, which is source code, which should be a typo. An built-in TIFF reader allows it to read uncompressed TIFF images, but it also needs an additional Libtiff library if it wants to read a compressed TIFF image.

As the official is not sealed, directly on the official website can be downloaded. We need to download tesseract-2.04.exe.tar.gz and tesseract-2.00.eng.tar.gz. Tesseract-2.04.exe.tar.gz is the main program. Tesseract-2.00.eng.tar.gz is a library of features that recognize the need for English and numbers, a bit similar to the virus database of antivirus software. tesseract-OCR also recognizes Dutch, Spanish and German, and so on, so we don't have to.

Finally, the software is not installed, decompression can be used. Unzip the tesseract-2.04.exe.tar.gz first, then extract the contents of the tesseract-2.00.eng.tar.gz to the root directory of tesseract, it is possible. If the location of the decompression tesseract-2.00.eng.tar.gz is not properly placed, the tesseract error will be run: Unable to load Unicharset file./tessdata/eng.unicharset.

---------------------------------------------------Verification Code Identification----------------------------------------------------

Two software relationships:

Tesseract is a graph blind, by default can only understand uncompressed TIFF images, if the direct use of tesseract processing other formats of pictures, will be error as follows:
Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized Image Type:code.jpg
Image::read_header:error:can ' t read this image type:code.jpg
Tesseract:Error:Read of File Failed:code.jpg

So we need to use ImageMagick to convert the image format, of course ImageMagick has other uses.

  Assuming that the image verification code that needs to be identified is code.jpg, there are only two steps we need to do:

Command line convert.exe-compress none-depth 8-alpha off Code.gif code.tif
Command line Tesseract.exe code.tif result

OK, the result is in the text file Result.txt inside, tesseract will automatically add suffix. txt after result.

Then make a few explanations for the two commands.

Part of the Convert.exe:ImageMagick Suite, responsible for image format conversion, the meanings of each parameter are as follows:
-compress None: Converted pictures Do not compress, if not add this item, subsequent tesseract processing will be error: Read_tif_image:Error:Illegal image format:compression
-depth 8: Sets the color of the converted image to 8 bits, which is the BPP 8. If you do not have this parameter, the consequences are as follows:
Tesseract Open Source OCR Engine
Check_legal_image_size:Error:Only 1,2,4,5,6,8 BPP is supported:16
Segmentation fault
-alpha off: Do not add an alpha layer in the converted image. If you do not have this parameter, the consequences are as above.
Followed by is the file name of the image to be converted, and finally the file name of the converted image.

Tesseract.exe:OCR in this way by our "misuse" to do verification code identification ~.
Code.tif: images to be identified
Result: The file name of the file where the result is stored, and tesseract automatically adds the suffix. txt after that.

Just so simple, just two commands, the content of the verification code obediently in the result file medium us.

----------------------------------------------------optimization of Dafa-----------------------------------------------------

At Master Huang's blog, I saw some possible optimization methods (not verified) and recorded the following:

To improve the recognition rate, you can first convert the image to grayscale. That is, black and white: When you add a parameter-monochrome (monochrome, non-black or white) or-colorspace Gray (gray chart, the degree of black will be different oh, the effect will be better).

Do amplification (take 150% for example): Convert In.tif-scale 150% in2.tif

If you want to crop an image, use the parameter-crop to capture a sub-image of a specified area from a picture "see here". The format is as follows:-crop widthxheight{+-}x{+-}y{%},width image width, height sub-image, X is positive when the x-coordinate from the upper-left corner of the range is negative, the upper-left coordinate is 0, then the X-pixel width is subtracted from the right of the truncated sub-image, Y is positive when the y-coordinate from the upper-left corner of the range is negative, the upper-left coordinate is 0, and the y-pixel height is subtracted from the top of the truncated sub-image.

---------------------------------------------------recognize Chinese characters-----------------------------------------------------

At this time Chinese recognition is not good, to download a Chinese bag: http://code.google.com/p/tesseract-ocr/downloads/detail?name=chi_sim.traineddata.gz&can= 2&q=

Then find the Tessdata directory, replace the eng.traineddata with Chi_sim.traineddata, and rename the Chi_sim.traineddata to Eng.traineddata

OK, now the Chinese recognition basically reached more than 90%

Using Open source program (IMAGEMAGICK+TESSERACT-OCR) to realize image verification code recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.