Tesseract
https://code.google.com/p/tesseract-ocr/
Currently the latest version is 3.02
After downloading the Windows version, use the command line to enter the extracted directory to run
Command format:
Usage:tesseract.exe imagename outputbase [-L lang] [-PSM pagesegmode]e ...] Pagesegmode values are:0=Orientation and Script detection (OSD) only.1=Automatic page Segmentation with OSD.2=Automatic page segmentation, but no OSD, or OCR3=Fully automatic page segmentation, but no OSD. (Default)4=assume a single column of text of variable sizes.5=assume a single uniform block of vertically aligned text.6=assume a single uniform block of text.7=Treat the image as a single text line.8=Treat the image as a single word.9= Treat the image as a single wordincha circle.Ten=Treat the image as a single character.-L Lang and/or-PSM Pagesegmode must occur before anyconfigfile. Single options:-V--version:versionInfo--list-langs:list Available languages forTesseract engine
Examples of commands:
F:\tesseract-ocr>tesseract.exe 2013-09-05_154628.jpg eng-l ENG-PSM 6
List of related commands:
Function |
Command |
|
Ambiguous_words.exe |
|
Classifier_tester.exe |
|
Cntraining.exe |
Integrate training files |
Combine_tessdata.exe |
|
Dawg2wordlist.exe |
|
Mftraining.exe |
|
Shapeclustering.exe |
Identification Program |
Tesseract.exe |
|
Unicharset_extractor.exe |
|
Wordlist2dawg.exe |
Required font File Reference code:
Tesseract-ocr\ccutil\tessdatamanager.h
Format requirements for font-related configuration files:
ASCII or UTF-8 encoding without BOM
Unix end-of-line marker (' \ n ')
The last character must is an end of line marker (' \ n '). Some text editors would show this as a empty line at the end of file. If you omit this got error message containing "Last_char = = ' \ n ': Error:assert failed ..."
Steps:
1. Create a training picture
Several principles:
Ensure that each character appears in general 10 times, characters commonly used 20 times, not characters commonly used 5 times;
The special characters should not be put together, should be more close to the actual use of the combination;
It is important to maintain a certain interval between characters and lines, which may result in failure. (may be fixed in version after 3.0)
The trained data needs to be grouped in font, and the same font text needs to be placed in the same TIFF file (multiple page pages are supported)
Unless the font size is too small (height less than 15px), it is not necessary to do different sizes of training;
Absolutely not to mix multiple fonts in the same image file
(You can refer to the Boxtiff file sample in the download page)
Next Print and Scan (or use some electronic rendering method) to the Create an image of your training page. Upto training files can be used (of multiple pages). It's best-to-create a mix of fonts and styles (but separate files), including italic and bold.
Generate a TIFF file
2. Making box files
Generate the Box File command:
tesseract [Lang]. [Fontname].exp[num].tif [Lang]. [Fontname].exp[num] Batch.nochop Makebox
Cases:
Tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop Makebox
3. Get a new character set
Reference Documentation:
API description in Doc directory after decompression
--end--
Tesseract 3.02 OCR text recognition survey record