Background introduction:
This time is doing a floating population management project, which requires the use of H5 Web page photo identification ID card, then the circle, this is not the function of the app? Products in order to quickly iterate the app has been the function of the H5 piled up, there is no way to solve the problem.
Looked up some information, found that in addition to the cost of OCR (Baidu, cloud vein, etc.) is better and support the Chinese only tesseract, of course, I charge the OCR I did not test.
Temporarily decided to use the tesseract.
Idea Introduction
My train of thought is this:
The H5 calls the camera-–> the photo is uploaded to the service--and the server identifies the identity information-–> the service side returns identity information.
Key points increase recognition rate
Because the identification rate of the whole ID card is too disappointing, the ID card image must be processed.
If we can intercept the key information on the ID card, can we improve the recognition rate of OCR?
Follow this hypothesis to start testing.
Installing TESSERACT-OCR
I am testing under Windows, Linux installation you can refer to this blog Linux installation TESSERACT-OCR
Windows is relatively simple, download the program installation is good, which requires language packs can be selected in the installation options to download the Chinese Language pack, the default is only English. The version I installed is tesseract-ocr3.01-1
Tesseract
Because wall reason, this URL cannot open directly, you understand, so use thunder download it. Copy link address to Thunderbolt new task, yes.
Agree, Next, select the language pack. I chose the simplified Chinese, but because the wall reason did not install!!!
The recognition rate of Chinese is too low, this people also know, the result with the product pulled off the skin, first do a version identification ID card number of good. That's why my headline is identifying the ID number instead of identifying the identity information.
Test after loading, find a digital picture to test
Go to picture directory run cmd tesseract 1.png result
Tesseract is an order.
1.png is a picture
Result is the TXT document name that needs to generate the results, whichever
The results are worrying.
Fortunately, we can improve the recognition rate method.
Increase the number recognition rate, specify the recognition character range
Locate the Tessdata\configs in the installation directory, open the digits file, and use the text editor as well.
I installed it in this directory
D:\Program Files (x86) \tesseract-ocr\tessdata\configs\digits
You will see the following sentence, we just need to identify the number and then left only the numbers and x good.
Tessedit_char_whitelist 0123456789-.
Switch
Tessedit_char_whitelist 0123456789X
Save
Let's test it this time, but the order needs to change, with digits in the back.
Tesseract 1.png result Digits
The results are more satisfactory.
The next step is to use the Java call to identify the command, in fact, it is simple to use Java call cmd command.
Using Java to invoke the identify command
Import Java.io.BufferedReader;Import Java.io.File;Import Java.io.FileInputStream;Import Java.io.InputStreamReader;Import java.util.ArrayList;Import java.util.List;/** * Created by Gavin Wang on 16-3-3. */PublicClasstesseract {PrivateFinal String lang_option ="-L";PrivateFinal String EOL = System.getproperty ("Line.separator");/** * File location I prevent in, project the same path */Private String Tesspath =New File ("Tesseract"). GetAbsolutePath ();/** *@param imagefile * Incoming image file *@param imageformat * Incoming image format *@return the recognized String */Public StringRecognizetext (File imagefile)Throws Exception {/** * Set the file directory to save the output file */Files OutputFile =New File (Imagefile.getparentfile (),"Output"); StringBuffer StrB =New StringBuffer (); list<string> cmd =New Arraylist<string> (); String OS = System.getproperty ("Os.name");if (Os.tolowercase (). StartsWith ("Win")) {Cmd.add ("Tesseract"); }else {Cmd.add ("Tesseract"); }Cmd.add (Tesspath + "\\tesseract"); Cmd.add (Imagefile.getname ()); Cmd.add (Outputfile.getname ());Cmd.add (lang_option);Cmd.add ("Chi_sim"); Cmd.add ("Digits");Cmd.add ("Eng");Cmd.add ("-PSM 7"); Processbuilder PB =New Processbuilder ();/** *sets This process builder ' s working directory. */Pb.directory (Imagefile.getparentfile ());Cmd.set (1, Imagefile.getname ()); Pb.command (CMD); Pb.redirecterrorstream (true); Process process = Pb.start ();Process process = Pb.command ("ipconfig"). Start ();System.out.println (System.getenv (). Get ("Path"));Process process = Pb.command ("D:\\Program Files (x86) \\Tesseract-OCR\\tesseract.exe", Imagefile.getname (), Outputfile.getname (), Lang_option, "Eng"). Start ();Tesseract.exe 1.jpg 1-l Chi_simRuntime.getruntime (). EXEC ("Tesseract.exe 1.jpg 1-l Chi_sim");/** * The exit value of the process. By convention, 0 indicates normal * termination. */System.out.println (Cmd.tostring ());int w = process.waitfor ();if (w = =0)0 means normal exit {BufferedReader in =New BufferedReader (New InputStreamReader (New FileInputStream (Outputfile.getabsolutepath () +". txt"),"UTF-8")); String str;while (str = In.readline ())! =NULL) {strb.append (str). append (EOL);} in.close (); }else {String msg;Switch (w) {Case1:msg ="Errors accessing files. There may is spaces in your image ' s filename. ";Break ; case 29:msg = "cannot recognize, the image or its selected region."; Break ; case 31:msg = "Unsupported image format."; Break ; default:msg = "Errors occurred.";} throw New RuntimeException (msg);} New File (Outputfile.getabsolutepath () + ". txt"). Delete (); return strb.tostring (). ReplaceAll ("\\s*", "");}}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
comments out of the content can be deleted, it was my test of the time left
Here is the Main method
Import Java.io.File;Import java.io.IOException;/** * Created by Gavin Wang on 16-3-3. */PublicClassStart {PublicStaticvoidmain (string[] args) throws Exception {tesseract ("/1.png "); Tesseract ( "/2.png"); Tesseract ( "/3.png"); Tesseract ( "/4.png"); Tesseract ( "/5.png"); Tesseract ( "/6.png");} private static void tesseract (string filestring) throws Exception {String FilePath = Start.class.getResource (filestring). GetFile (). toString (); //processimg (FilePath); File File = new file (FilePath); String Recognizetext = new tesseract (). Recognizetext (file); System.out.println (Recognizetext); }}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21st
- 22
- 23
- 24
It is possible that you have failed to execute tesseract in the xx directory ....
It would be nice to restart the computer at this time because the environment variables are not effective in idea.
Test results are as follows, is still relatively satisfactory
Postscript
Recognize it's the first to come to a finish, the next log we continue how to use H5 to identify the identity card number .
other thoughts on the improvement of recognition rate
In fact, the identification of the number I also want to do some processing, such as the background of the ID card to remove, leaving only the number of identity cards. It turns into the following.
But I found that after processing some of the recognition rate is reduced, did not study the principle of TESSERACT-ORC, I do not know why this recognition rate is reduced. I hope you can study how to improve the recognition rate in the following message to tell me, thank you.
Second article How to use H5 identification ID number can be found in my blog address Http://blog.csdn.net/hiredme
(Transferred from http://blog.csdn.net/hiredme/article/details/50894814)
Top Java identification ID number, H5 identification ID number, TESSERACT-OCR identification (i) (EXT)