I believe that you will need to develop programs to recognize text on images (the so-called OCR), such as recognition of license plates, recognition of product prices in image formats, and identification of email addresses in image formats, of course, the most important thing is to identify the verification code. To complete these OCR tasks, you need to master the knowledge of image processing and image recognition. You need to use many complex theories such as image morphology, Fourier transformation, matrix transformation, and Bayesian decision-making, this makes the vast majority of people discouraged.
The emergence of tesseract, an open-source project, allows ordinary people to get involved in OCR development. Tesseract can recognize text content from images, but do not think that Tesseract can intelligently identify various strange and complex image texts, by default, Tesseract can only recognize very standard fonts, clear and non-interfering image texts. Many people who have just been in contact with Tesseract will make such a comment: "tesseract is very powerful, but the recognition rate is very low, ". In fact, the content we want to identify is amazing. tesseract requires training to achieve high accuracy. We need to let tesseract identify a batch of sample images, then, correct the error results he identified and tell him that "you have identified this image incorrectly. It should be recognized as a certain person." In this way, Tesseract will gradually "Learn" how to identify it. That is, for example, the next training process:
What is the meaning of "preprocessing? We know that many verification codes are handled with some interference, for example, some verification codes add noise points, some verification codes add interference lines, some verification codes add interference backgrounds, and some verification codes are distorted. For example:
If these images are directly handed over to Tesseract for processing, the recognition will be very difficult. Developers should pre-process images before handing them over to tesseract, such as removing interference lines, removing background noise, and correcting characters, some Complex preprocessing operations may involve more in-depth theories in graph morphology. This is not an article that can be described. The following only lists the basic knowledge of relatively simple image preprocessing, for more information, see related Graphics documents.
1 ,. the image object class in. NET is the image class. fromfile (File) to load an image. Generally, the image is a bitmap, And the bitmap class is a subclass of the image class. fromfile () return value converted to bitmap type using Bitmap bitmap = (Bitmap) image. fromfile (file)
2. bitmap. Save () is used to save the image objects in the memory to the output. The second parameter is the image format.
3. Because bitmap is associated with the unmanaged resources of GDI and implements the idisposable interface, you need to use using to manage object resources to avoid program memory leakage. About using, idisposable C #/. Net basic knowledge here is not introduced, not clear, please refer to the wisdom of the podcast. Net Training Institute published free. net video tutorial, as follows: http://net.itcast.cn/
4. If you want to perform efficient image operations, you must use pointers to bitmap operations. Of course, to avoid readers unfamiliar with C # pointer operations, in this article, I will use the getpixel and setpixel methods that are less efficient but easy to understand to perform image operations. Getpixel and setpixel are two methods provided by bitmap. They can be used to read and set the color of the specified coordinate pixel respectively.
The following describes how to use tesseract:
I. First, we need to collect multiple representative verification code sample images, because the training process of complicated verification codes is long, and this intelligence-based podcast. the free public course time for verification code recognition held by net college is limited, so I chose a relatively simple verification code for recognition. The process of identifying complex verification codes is similar. I tested Verification Code images in the document. The last part of the article is the public course software library and the Code. Zip package.
II,
These pictures have some obvious noise backgrounds and interference lines, but the noise background and interference line colors are those, so I used the color picker to pick up the color of these points, use the following code to replace the colors with white and save the images in Tif format:
String [] files = directory. getfiles (@ "D: \ kuaipan \ Chuanzhi materials \ class materials \ open classes \ verification code recognition in February \ haijia ","*. GIF "); For (INT I = 0; I <files. length; I ++) {string file = files [I]; using (Bitmap bitmap = (Bitmap) image. fromfile (File) using (Bitmap newbitmap = process (Bitmap) {newbitmap. save (@ "F: \ AA \" + I + ". TIF ", imageformat. tiff) ;}} Private Static bitmap process (Bitmap bitmap) {bitmap newbitmap = new Bitmap (bitmap. width, bitmap. height); For (INT x = 0; x <bitmap. width; X ++) {for (INT y = 0; y <bitmap. height; y ++) {// remove the border if (x = 0 | Y = 0 | x = bitmap. width-1 | Y = bitmap. height-1) {newbitmap. setpixel (X, Y, color. white);} else {color = bitmap. getpixel (x, y); // if the color of the vertex is the background interference color, it is white if (color. equals (color. fromargb (204,204, 51) | color. equals (color. fromargb (153,204, 51) | color. equals (color. fromargb (204,255,102) | color. equals (color. fromargb (204,204,204) | color. equals (color. fromargb (204,255, 51) {newbitmap. setpixel (X, Y, color. white);} else {newbitmap. setpixel (X, Y, color) ;}}} return newbitmap ;}
F: \ AA \ There will be 100 converted images in the folder, and the effect after conversion is as follows:
We can see that the background color and interference line are all removed.
3. Run jtessboxeditor (jtessboxeditor is written in Java. Therefore, you need to install and configure the Java Runtime Environment first. If you are not familiar with the installation and configuration of the Java Runtime Environment, please find your own materials ), double-click jtessboxeditor. jar to start running. Use the main menu "tool → merge tiff" image of the tiff processed in step 2 to merge it into an image, for example, save it to the haijia. tif file under F: \ AA.
4. Download and install tesseract-ocr-setup-3.01-1.exe (I have a problem with 3.02. I don't know if it is my problem or I won't use it. In short, we recommend that you use Version 3.01 first ), this setup version automatically adds the installation directory to the path environment variable. We recommend that you use this version. To download the portable version, you must edit the environment and add the Tesseract decompression path to the path environment variable.
5. Before proceeding to the next step, you need to give the training result a name, for example, haijia. If you get another name, you just need to change "haijia" in all subsequent operations to your name.
6. Start the Windows Command window and enter the directory of the haijia.tiffile. Then, execute tesseract.exe haijia. tif haijia batch. nochop makebox. Haijia. tif is the file name generated in step 3, and haijia is the training name. This will generate the haijia. Box file of the initial recognition result.
7. Ensure that the haijia. Box file and haijia. tif file name are identical and put in the same folder. Use jtessboxeditor to open the haijia. tif file, correct the text one by one, and save it. Note: In jtessboxeditor, you must press enter every time you have modified the character. Save the modification as soon as possible. If you find that you have identified multiple letters, two letters are recognized as one letter, and one letter is recognized as two letters, you need to use functions such as merge, split, and delete for fine-tuning, you can also modify the automatically recognized regions by modifying X, Y, W, and H.
8. All automatic identification results are completed and executed by tesseract.exe haijia. tif haijia nobatch box. Train.
9. Execute the command line unicharset_extractor.exe haijia. Box
10. Create a file named "font_properties" in the directory and input the text (haijia is the training name, use an advanced text editor such as editplus to remove the BOM header or save it in ANSI format.): haijia 1 0 0 1 0
11. Execute the command line cntraining.exe haijia. tr
12. Run mftraining.exe-F font_properties-u unicharset haijia. tr on the command line.
After step 13 and Step 4 are completed, several files should be generated under the directory. The four files unicharset, inttemp, normproto, and pfftable should be prefixed with the training name "haijia .".
14. Run "combine_tessdata haijia." On the command line to merge the generated haijia. traineddata training file. After this step is completed, a haijia should be generated under the folder. traineddata file. This file is the training data file used for identification. You only need this haijia file. you can use the traineddata file to remove the TIF image file and other intermediate files.
15. Call the Tesseract API to identify the verification code in the program. Tesseract supports many languages. C/C ++, Java,. net, PHP, and Python all have corresponding API encapsulation. You just need to find the api library of your language. The following uses. NET as an example.
A) Use tesseractdotnet_v3020.r590.zip to add tesseract. DLL to the reference. Note that tesseractdotnet is only supported by default. NET 2.0, so you need to change the target of the project. NET 2.0. If you need to use it in Versions later than 2.0, You need to download the source code of tesseractdotnet for compilation.
B) use the same method in step 2 to pre-process the recognized image, and then execute the following code to identify the pre-processed image object:
Using (Bitmap bitmap = (Bitmap) image. fromfile (File) using (Bitmap newbitmap = process (Bitmap) // image preprocessing {tesseractprocessor processor = new tesseractprocessor (); processor. setpagesegmode (epagesegmode. psm_single_line); // F: \ AA \ is haijia. the folder where the traineddata file is located. Note that the path must end with "\", and the path must be separated by "\", instead of "/". Otherwise, an error occurs in the accessviolation exception processor. init (@ "F: \ AA \", "haijia", (INT) eocrenginemode. oem_default); string result = processor. recognize (newbitmap); // The returned value is the identification result MessageBox. show (result );}
Cool! Handsome!
After all, the text performance is limited. Therefore, I will publish the classroom videos I 've talked about in this intelligence-based podcast Open Class for free. It will be more convenient for you to learn from the video tutorials, video tutorials also talked about some of the things I did not talk about in this article, video tutorial: http://dl.vmall.com/c0kvta13ex