Tesseract font training materials
1. Create a. Box file.
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox
2. Start Training
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train
Or
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr
Set_unicharset_properties
I don't know what it is.
training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata
Font_properties
Font property File
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
In <font>, It is a font named by a string; <italic >,< bold >,< fixed>, both <linefeed> and <goth> are simple 0 or 1 signs indicating whether the font is true or not.
Example:
timesitalic 1 0 0 1 0
---- In 3.03, there is a default font_properties file that covers 3000 fonts (not necessarily accurate) training/langdata/font_properties.
Clustering
Shapeclustering creates the cluster shape of the master shape table and writes it to a file shapetable.
shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
---- If you get an error message like "index> = 0 & index <size_used _: Error: assert failed in genericvector. h, line 512 "add the shapetable file to your language data file.
mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
Your file is the unicharset generated by unicharset_extractor or above, and Lang. unicharset is the output unicharset that will be given to combine_tessdata. Mftraining will output two data files: inttemp (shape prototype) and pffmtable (expected functionality for each character ).
Output The normproto Data File
cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...
Data Dictionary (optional)
Name |
Type |
Description |
Word-dawg |
Dawg |
A Dawg made from dictionary words from the language. |
Freq-dawg |
Dawg |
A Dawg made from the most frequent words which wowould have gone into word-dawg. |
Punc-dawg |
Dawg |
A Dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. |
Number-dawg |
Dawg |
A Dawg made from tokens which originally contained digits. Each digit is replaced by a space character. |
Fixed-length-Dawgs |
Dawg |
Several dawgs of different fixed lengths -- useful for different ages like Chinese. |
Bigram-dawg |
Dawg |
A Dawg of word bigrams where the words are separated by a space and each digit is replaced by ?. |
Unambig-dawg |
Dawg |
Todo: describe. |
User-Words |
Text |
A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract (1 ). |
wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
wordlist2dawg words_list lang.word-dawg lang.unicharset
References:
Wiki
Https://code.google.com/p/tesseract-ocr/wiki/FAQ
Introduction
Https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_ (new_in_3.01)
Wordlist2dawg (1) manual page
Http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html
Combine_tessdata (1) manual page
Http://tesseract-ocr.googlecode.com/svn-history/r800/trunk/doc/combine_tessdata.1.html