Tesseract font training materials

Last Update:2014-07-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tesseract font training materials

1. Create a. Box file.

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox

2. Start Training

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr

Set_unicharset_properties

I don't know what it is.

training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata

Font_properties

Font property File

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

In <font>, It is a font named by a string; <italic >,< bold >,< fixed>, both <linefeed> and <goth> are simple 0 or 1 signs indicating whether the font is true or not.

Example:

timesitalic 1 0 0 1 0

---- In 3.03, there is a default font_properties file that covers 3000 fonts (not necessarily accurate) training/langdata/font_properties.

Clustering

Shapeclustering creates the cluster shape of the master shape table and writes it to a file shapetable.

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

---- If you get an error message like "index> = 0 & index <size_used _: Error: assert failed in genericvector. h, line 512 "add the shapetable file to your language data file.

mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

Your file is the unicharset generated by unicharset_extractor or above, and Lang. unicharset is the output unicharset that will be given to combine_tessdata. Mftraining will output two data files: inttemp (shape prototype) and pffmtable (expected functionality for each character ).

Output The normproto Data File

cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...

Data Dictionary (optional)

Name	Type	Description
Word-dawg	Dawg	A Dawg made from dictionary words from the language.
Freq-dawg	Dawg	A Dawg made from the most frequent words which wowould have gone into word-dawg.
Punc-dawg	Dawg	A Dawg made from punctuation patterns found around words. The "word" part is replaced by a single space.
Number-dawg	Dawg	A Dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
Fixed-length-Dawgs	Dawg	Several dawgs of different fixed lengths -- useful for different ages like Chinese.
Bigram-dawg	Dawg	A Dawg of word bigrams where the words are separated by a space and each digit is replaced by ?.
Unambig-dawg	Dawg	Todo: describe.
User-Words	Text	A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract (1 ).

wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
wordlist2dawg words_list lang.word-dawg lang.unicharset

References:

Wiki

Https://code.google.com/p/tesseract-ocr/wiki/FAQ

Introduction

Https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_ (new_in_3.01)

Wordlist2dawg (1) manual page

Http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html

Combine_tessdata (1) manual page

Http://tesseract-ocr.googlecode.com/svn-history/r800/trunk/doc/combine_tessdata.1.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tesseract font training materials

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tesseract font training materials

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support