Tesseract font training materials

Source: Internet
Author: User
Tesseract font training materials

1. Create a. Box file.

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox

 

2. Start Training

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train

Or

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr
Set_unicharset_properties

I don't know what it is.

training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata

 

Font_properties

Font property File

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

In <font>, It is a font named by a string; <italic >,< bold >,< fixed>, both <linefeed> and <goth> are simple 0 or 1 signs indicating whether the font is true or not.

Example:

timesitalic 1 0 0 1 0

 

---- In 3.03, there is a default font_properties file that covers 3000 fonts (not necessarily accurate) training/langdata/font_properties.

 

Clustering

Shapeclustering creates the cluster shape of the master shape table and writes it to a file shapetable.

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

---- If you get an error message like "index> = 0 & index <size_used _: Error: assert failed in genericvector. h, line 512 "add the shapetable file to your language data file.

 

 

mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

Your file is the unicharset generated by unicharset_extractor or above, and Lang. unicharset is the output unicharset that will be given to combine_tessdata. Mftraining will output two data files: inttemp (shape prototype) and pffmtable (expected functionality for each character ).

 

Output The normproto Data File

cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...

 

 

Data Dictionary (optional)

 

Name Type Description
Word-dawg Dawg A Dawg made from dictionary words from the language.
Freq-dawg Dawg A Dawg made from the most frequent words which wowould have gone into word-dawg.
Punc-dawg Dawg A Dawg made from punctuation patterns found around words. The "word" part is replaced by a single space.
Number-dawg Dawg A Dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
Fixed-length-Dawgs Dawg Several dawgs of different fixed lengths -- useful for different ages like Chinese.
Bigram-dawg Dawg A Dawg of word bigrams where the words are separated by a space and each digit is replaced by ?.
Unambig-dawg Dawg Todo: describe.
User-Words Text A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract (1 ).
wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
wordlist2dawg words_list lang.word-dawg lang.unicharset

 

References:

Wiki

Https://code.google.com/p/tesseract-ocr/wiki/FAQ

 

Introduction

Https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_ (new_in_3.01)

 

Wordlist2dawg (1) manual page

Http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html

 

Combine_tessdata (1) manual page

Http://tesseract-ocr.googlecode.com/svn-history/r800/trunk/doc/combine_tessdata.1.html

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.