Tesseract 3.02 OCR text recognition survey record

Last Update:2016-03-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Installation using:

Tesseract

https://code.google.com/p/tesseract-ocr/

Currently the latest version is 3.02

After downloading the Windows version, use the command line to enter the extracted directory to run

Command format:

Usage:tesseract.exe imagename outputbase [-L lang] [-PSM pagesegmode]e ...] Pagesegmode values are:0=Orientation and Script detection (OSD) only.1=Automatic page Segmentation with OSD.2=Automatic page segmentation, but no OSD, or OCR3=Fully automatic page segmentation, but no OSD. (Default)4=assume a single column of text of variable sizes.5=assume a single uniform block of vertically aligned text.6=assume a single uniform block of text.7=Treat the image as a single text line.8=Treat the image as a single word.9= Treat the image as a single wordincha circle.Ten=Treat the image as a single character.-L Lang and/or-PSM Pagesegmode must occur before anyconfigfile. Single options:-V--version:versionInfo--list-langs:list Available languages forTesseract engine

Examples of commands:

F:\tesseract-ocr>tesseract.exe 2013-09-05_154628.jpg eng-l ENG-PSM 6

List of related commands:

Function	Command
	Ambiguous_words.exe
	Classifier_tester.exe
	Cntraining.exe
Integrate training files	Combine_tessdata.exe
	Dawg2wordlist.exe
	Mftraining.exe
	Shapeclustering.exe
Identification Program	Tesseract.exe
	Unicharset_extractor.exe
	Wordlist2dawg.exe

Font Training

Required font File Reference code:

Tesseract-ocr\ccutil\tessdatamanager.h

Format requirements for font-related configuration files:

ASCII or UTF-8 encoding without BOM

Unix end-of-line marker (' \ n ')

The last character must is an end of line marker (' \ n '). Some text editors would show this as a empty line at the end of file. If you omit this got error message containing "Last_char = = ' \ n ': Error:assert failed ..."

Steps:

1. Create a training picture

Several principles:

Ensure that each character appears in general 10 times, characters commonly used 20 times, not characters commonly used 5 times;

The special characters should not be put together, should be more close to the actual use of the combination;

It is important to maintain a certain interval between characters and lines, which may result in failure. (may be fixed in version after 3.0)

The trained data needs to be grouped in font, and the same font text needs to be placed in the same TIFF file (multiple page pages are supported)

Unless the font size is too small (height less than 15px), it is not necessary to do different sizes of training;

Absolutely not to mix multiple fonts in the same image file

(You can refer to the Boxtiff file sample in the download page)

Next Print and Scan (or use some electronic rendering method) to the Create an image of your training page. Upto training files can be used (of multiple pages). It's best-to-create a mix of fonts and styles (but separate files), including italic and bold.

Generate a TIFF file

2. Making box files

Generate the Box File command:

tesseract [Lang]. [Fontname].exp[num].tif [Lang]. [Fontname].exp[num] Batch.nochop Makebox

Cases:

Tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop Makebox

3. Get a new character set

Other

Reference Documentation:

API description in Doc directory after decompression

--end--

Tesseract 3.02 OCR text recognition survey record

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tesseract 3.02 OCR text recognition survey record

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tesseract 3.02 OCR text recognition survey record

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support