I believe that in the development of some programs will have to identify the image of the text (known as OCR) needs, such as the identification of the license plate, the identification of the image format of the commodity price, the identification of image format email address and so on, of course, the most demand or identification verification code. If you want to complete the work of these OCR, you need to master the image processing, image recognition knowledge, need to use graphics morphology, Fourier transform, matrix transformation, Bayesian decision-making and many other complex theories, which makes most people will be deterred.
Tesseract the emergence of this open source project allows us to get involved in the development of OCR. Tesseract can be from the picture to identify the text content, but do not think Tesseract can intelligently identify a variety of grotesque, complex picture text, tesseract by default can only recognize very standard fonts, clear and non-intrusive picture text, Just contact Tesseract A lot of people will be issued such a review "tesseract blowing very bad, but the recognition rate is very low, not good use." In fact, we want to identify the content of strange, tesseract is the need to train to compare high accuracy rate of identification, we need to put a batch of sample pictures let Tesseract to try to identify, and then to his identified error results corrected, told him "This picture you identify the wrong, should be recognized as a certain" , so that tesseract slowly "learned" how to identify. This is a training process:
See there is a "preprocessing", what does that mean? We know that many of the verification code is added some interference processing, for example, some verification code added noise point, some verification code added interference line, some verification code with interference background, and some verification code to do the text distortion. Such as:
These images can be very difficult to identify if they are handed to tesseract directly. Developers should be in the picture before the tesseract to compare the pre-processing of the image, such as removing the interference line, remove background noise, character correction, and so on, some complex preprocessing operations may involve graphic morphology of the more in-depth theory, this is not an article can be introduced, The following is a simple picture preprocessing basic knowledge, in-depth study please refer to graphics related materials.
Basic explanation:
1,. NET Picture object class is the image class, using Image.FromFile (file) to load a picture, the general picture is a bitmap, the bitmap class is a subclass of the image class, so we generally put the image. FromFile () return value converted to bitmap type using bitmap bitmap = (bitmap) image.fromfile (file)
2, Bitmap. Save () is used to store the in-memory picture object in the output. The second argument is a picture format.
3, because the bitmap is associated to the unmanaged resources of GDI, implements the IDisposable interface, so it is necessary to use using the resource management of the object to avoid the problem of program memory leakage.
4, if you want to do an efficient picture operation, you need to work with the pointer to bitmap, of course, in order to avoid the C # pointer operation is not familiar to readers, this article I will use a slightly less efficient but more understandable getpixel, SetPixel method to do the picture operation. GetPixel and SetPixel are the two methods provided by bitmap, which can be used to read and set the color of the specified coordinate pixel points, respectively, for the picture.
Let's start by explaining the use of tesseract:
First of all, we need to collect a number of representative sample verification code samples, because the more complex verification code training process will be longer, and this time Preach Intelligence podcast. NET Academy has a limited number of free public classes for verification code recognition, so I picked a relatively simple verification code to identify. The identification process of complex verification code is very similar. I tested 100 CAPTCHA images in the article at the end of the "public lesson software, picture library and code. Zip" Compression package.
These pictures have some obvious noise background and interference lines, but the noise background and the interference line color is that several, so I use the color picker to pick up the colors of these points, using the following code to replace those colors with white, and save the picture in TIF format:
The post-conversion effect is as follows:
You can see that the background color and the interference line are all removed.
Next Run Jtessboxeditor (Jtessboxeditor is written in Java, so first need to install the configuration Java Runtime Environment, the Java Runtime Environment installation configuration is not familiar with friends, please look for information), Double-click Jtessboxeditor.jar to start the run. The second-step TIFF uses the main Menu "Tool→merge tiff" image to be merged into a single picture, such as saved to F:\aa\ under Haijia.tif file.
Download Install Tesseract-ocr-setup-3.01-1.exe (I use 3.02 have a problem, do not know whether it is my problem or I will not use, in short, we recommend that you use the 3.01 version), this setup version will automatically add the installation directory to the PATH environment variable, recommended to use. If you download the portable version, you will need to add the Tesseract decompression path to your PATH environment variable by editing your environment.
Before the next step, you need to give the training results a name, for example, I will take the name Haijia. If you take a different name, just change the "Haijia" in all the actions in the back to your name.
Start the Windows command-line window, and enter the directory where the Haijia.tif file is located, and then perform Tesseract.exe haijia.tif Haijia batch.nochop makebox. Where Haijia.tif is the file name of the merged TIFF file generated in the third step, and Haijia is the training name we take. This will generate the Haijia.box file for the initial recognition result
Ensure that the Haijia.box files and haijia.tif file filenames are exactly the same and placed under the same folder. Use Jtessboxeditor to open the Haijia.tif file, correct the text one after the other and save it. Note in the jtessboxeditor every time the characters are changed to enter, pay attention to save in time. If you find that multiple identifiers, two letters are recognized as one letter, and one letter is recognized as two letters, you need to fine-tune the functions such as merge, Split, delete, and modify the automatically recognized areas by modifying X, Y, W, H.
All automatic recognition results are corrected after the command line executes Tesseract.exe haijia.tif Haijia nobatch box.train
Command line execution Unicharset_extractor.exe Haijia.box
Create a new file named "Font_properties" in the directory, and enter text (where Haijia is the name of the training, save it using advanced text editor such as EditPlus to remove the BOM header or save in ANSI format): Haijia 1 0 0 1 0
Command line execution Cntraining.exe haijia.tr
Command line execution mftraining.exe-f font_properties-u unicharset haijia.tr
After the completion of the 12th step, a number of files should be generated under the directory, Unicharset, inttemp, Normproto, pfftable the four files plus the training name prefix "Haijia."
The command line executes "Combine_tessdata Haijia." To merge the generated haijia.traineddata training files. After performing this step, folder should be generated a haijia.traineddata file, which is the identification of training data files, only need this one haijia.traineddata file, the previously used TIF picture files and other intermediate files are not required.
Next, call Tesseract's API to identify the CAPTCHA in the program. Tesseract supports a wide range of languages, with the corresponding API packages for C + +, Java,. Net, PHP, and Python, just look for the API libraries that you use in your language. The following is still a. NET as an example.
Use the Tesseract.dll in Tesseractdotnet_v301_r590.zip to add to the appropriate reference. Note Tesseractdotnet only supports. NET 2.0 by default, so you need to modify the target of the project to. NET 2.0, and if you need to use it in more than 2.0 versions, you will need to download the source code for your tesseractdotnet to compile.
Use the same method as the second step to preprocess the identified image, and then execute the following code to identify the image object after preprocessing:
C # Identification Verification code technology-tesseract