First of all I have to admit that the focus on TESSERACT-OCR, is directed at the following this article gimmick go, 26 line groovy Code hack website Verification Code
http://www.kellyrob99.com/blog/2010/03/14/breaking-weak-captcha-in-slightly-more-than-26-lines-of-groovy-code/
Of course, after looking to know, originally called the three-party library TESSERACT-OCR ...
http://code.google.com/p/tesseract-ocr/
Nevertheless, in the spirit of Grandpa Deng's "no matter White cat black cat, can seize the mouse is a good cat" principle, while the holiday also began the "word recognition" of the primary research
HP's tesseract has recently been supported by Google and support English letters and numbers, it is said that the degree of recognition is ranked third in the world, and more commendable, the provision of multi-language pack download (including Chinese, accuracy is true ...) ), and bring your own training tool.
After installing and running through the example, the first thought of the application is naturally used for verification Code analysis
According to the instructions, the quality of the images fed into tesseract directly affects the effect of recognition, so simple preprocessing is essential
1. First grayscale, gray value =0.3r+0.59g+0.11b:
Java code
- for (int y = miny; y < height; y++) {
- For (int x = MinX; x < width; + x + +) {
- int RGB = SRCIMG.GETRGB (x, y);
- Color color = new color (RGB); //R,g,b color is obtained according to the int value of RGB.
- int gray = (int) (0.3 * color.getred () + 0.59
- * Color.getgreen () + 0.11 * color.getblue ());
- Color Newcolor = new Color (gray, gray, gray);
- Srcimg.setrgb (x, Y, Newcolor.getrgb ());
- }
- }
Results
2. Followed by grayscale inversion:
Java code
- for (int y = miny; y < height; y++) {
- For (int x = MinX; x < width; + x + +) {
- int RGB = BUFFIMG.GETRGB (x, y);
- Color color = new color (RGB); //R,g,b color is obtained according to the int value of RGB.
- Color Newcolor = new Color (255-color.getred (), 255-color
- . Getgreen (), 255-color.getblue ());
- Buffimg.setrgb (x, Y, Newcolor.getrgb ());
- }
- }
Results
3. Again is two value, take the average grayscale of the picture as the threshold value, below which all is 0, above this value all is 255:
Java code
- for (int y = miny; y < height; y++) {
- For (int x = MinX; x < width; + x + +) {
- int RGB = BUFFIMG.GETRGB (x, y);
- Color color = new color (RGB); //R,g,b color is obtained according to the int value of RGB.
- int value = 255-color.getblue ();
- if (value > average) {
- Color Newcolor = new Color (0, 0, 0);
- Buffimg.setrgb (x, Y, Newcolor.getrgb ());
- } Else {
- Color Newcolor = new Color (255, 255, 255);
- Buffimg.setrgb (x, Y, Newcolor.getrgb ());
- }
- }
- }
Results
See how it works, eliminating the steps of sizing, median filtering, and noise removal.
The above completes the picture preprocessing work; Tesseract does not have an open API, pure command line invocation:
Java code
- list<string> cmd = new arraylist<string> (); //array to hold command line arguments
- Cmd.add (Tesspath + "\\tesseract");
- Cmd.add ("");
- Cmd.add (Outputfile.getname ()); //Output file location
- Cmd.add (lang_option); //Character categories
- Cmd.add ("Eng"); //English, find the corresponding dictionary file in Tessdata.
- Processbuilder PB = new Processbuilder ();
- Pb.directory (Imagefile.getparentfile ());
- Cmd.set (1, Tempimage.getname ()); //Put the picture file location in the first position
- Pb.command (CMD); //execute command line
- Pb.redirecterrorstream (true); //Notifies the process generator whether to combine standard errors and standard output to save process errors.
- Process process = Pb.start (); //Start the execution process
- int w = process.waitfor (); //The current process stops, until process stops executing, returning execution results.
The result output indicates that everything is OK
Of course, really want to use good TESSERACT-OCR, but also need to its powerful training tools, is something ...
In addition, about the word recognition, remove as a crack verification code of the counter-means, we also have relevant applications?
Use TESSERACT-OCR to hack website verification code