Android ocr text description, androidocr Text Recognition
Recently, I am working on ID card number identification. After some online searches, I found that tesseract-OCR is a powerful open-source ocr program, which was developed by HP from 1985 to 1995, later, google was directly responsible. After further development by google, tesseract-ocr has been significantly improved.
Tesseract-ocr works with the Leptonica Image Library, which can read multiple image formats and convert them into texts in more than 60 languages. It can work on Linux, Windows, Mac OSX, and other systems, and can be compiled on android and iphone platforms.
At present, the android version in this address: https://code.google.com/p/tesseract-android-tools/, this version needs to download a lot of associated library files, I had a lot of problems during the compilation, and then no way to find this project on the Internet: compile and compile it. The related introductions on github are very detailed. The compilation process is not described here. I encountered a permission issue during compilation, use chmod 777 for file permissions. /modify the command and then click OK. The so file compiled in libs is the library file we need for development.
In android, the use of tesseract-ocr in tess-two this project has an example program, but the write is relatively simple, here there is an open source identification project, do a good job: https://github.com/rmtheis/android-ocr, I developed this project for reference, but after using it, I found that the effect on ID card recognition is not good, the recognition rate is not very high, and often cannot be identified. OCR recognition Library: TTS.
There are many methods for OCR training on the internet, http://my.oschina.net/lixinspace/blog/60124, NLP.
First, we need the following tools:
Tesseract-ocr-3.01, the latest version of 3.02 I used a problem on my machine
JTessBoxEditor, a box editor written in java
1. Create a New trainocr folder, copy the above two files, decompress the two files, and create a new temp folder under the Tesseract-ocr folder.
2. Next we will prepare materials for training, as shown in figure
To improve the recognition rate, we need to provide multiple images like the above. I used more than 50 images in the ID card number recognition library for training. After training, I spent my eyes, the image format must be in tiff format. You can use the built-in drawing tool of windows to save it as tiff format. After preparing multiple image tiff images, open jTessBoxEditor. jar, such
Before that, we need to create a custom under the temp folder created in step 1. for the tif file, select tool-> Merge TIFF and then select multiple prepared tiff images. Note that all images are selected here, click open, and select the newly created M. tif file, click Save, so that we can merge multiple tiff images into one file.
3. Now we start to generate the box file. Run the cmd command line to enter the temp folder and enter the following command:
D: \ Trainocr \ Tesseract-ocr \ temp> .. \ tesseract.exe custom. tif custom batch. nochop makebox
After the input, a custom. box file is added to the temp folder, which records the identified characters and their corresponding coordinates.
4. Next we will start to correct it. We also use the jTessBoxEditor tool to switch to the Box Editor and open custom. tif,
Adjust the characters to be corrected by using X, Y, W, and H in the upper right corner. Remember to save the changes.
5. Run the following command to calculate the character set:
D: \ Trainocr \ Tesseract-ocr \ temp> .. \ unicharset_extractor.exe custom. box
6. Next, we need to create a font_properties file in the temp folder. OCR of Version 3.01 requires this file to provide the font style information recognized during output, the file format is
<Fontname> <italic> <bold> <fixed> <serif> <fraktur>
Timesitalic 1 0 0 1 0
We can create font_properties based on the actual situation. What I wrote is
Custom 0 0 0 0 0
It indicates a common font with no format.
Run the following command:
D: \ Trainocr \ Tesseract-ocr \ temp> .. \ mftraining.exe-F font_properties-U unicharset custom. tr
7. Clustering, enter the command
D: \ Trainocr \ Tesseract-ocr \ temp> .. \ cntraining.exe custom. tr
8. Now there are many files in the temp folder. You need to add the prefix custom to the files inttemp, Microfeat, normproto, pffmtable, and unicharset. (Note that there is a dot), and then enter the following command
D: \ Trainocr \ Tesseract-ocr \ temp> .. \ combine_tessdata.exe custom.
In the result, we need to determine that the data behind type 1, type3, type4, and type5 cannot be-1, so that we can use this new dictionary to identify it, the generated m. copy the traineddata file to the tessdata folder, and then
Tesseract test.jpg result | custom
The new dictionary can be used for identification. The test results show that the recognition rate is indeed improved. In practical applications, we need to use multiple images to generate the recognition library we need through the above steps so that the recognition rate can be improved.