Tesseract 3.02 OCR text recognition survey record

Source: Internet
Author: User

    • Installation using:

Tesseract

https://code.google.com/p/tesseract-ocr/

Currently the latest version is 3.02

After downloading the Windows version, use the command line to enter the extracted directory to run

Command format:

Usage:tesseract.exe imagename outputbase [-L lang] [-PSM pagesegmode]e ...] Pagesegmode values are:0=Orientation and Script detection (OSD) only.1=Automatic page Segmentation with OSD.2=Automatic page segmentation, but no OSD, or OCR3=Fully automatic page segmentation, but no OSD. (Default)4=assume a single column of text of variable sizes.5=assume a single uniform block of vertically aligned text.6=assume a single uniform block of text.7=Treat the image as a single text line.8=Treat the image as a single word.9= Treat the image as a single wordincha circle.Ten=Treat the image as a single character.-L Lang and/or-PSM Pagesegmode must occur before anyconfigfile. Single options:-V--version:versionInfo--list-langs:list Available languages forTesseract engine

Examples of commands:

F:\tesseract-ocr>tesseract.exe 2013-09-05_154628.jpg eng-l ENG-PSM 6

List of related commands:

Function Command
Ambiguous_words.exe
Classifier_tester.exe
Cntraining.exe
Integrate training files Combine_tessdata.exe
Dawg2wordlist.exe
Mftraining.exe
Shapeclustering.exe
Identification Program Tesseract.exe
Unicharset_extractor.exe
Wordlist2dawg.exe

    • Font Training

Required font File Reference code:

Tesseract-ocr\ccutil\tessdatamanager.h

Format requirements for font-related configuration files:

ASCII or UTF-8 encoding without BOM

Unix end-of-line marker (' \ n ')

The last character must is an end of line marker (' \ n '). Some text editors would show this as a empty line at the end of file. If you omit this got error message containing "Last_char = = ' \ n ': Error:assert failed ..."

Steps:

1. Create a training picture

Several principles:

Ensure that each character appears in general 10 times, characters commonly used 20 times, not characters commonly used 5 times;

The special characters should not be put together, should be more close to the actual use of the combination;

It is important to maintain a certain interval between characters and lines, which may result in failure. (may be fixed in version after 3.0)

The trained data needs to be grouped in font, and the same font text needs to be placed in the same TIFF file (multiple page pages are supported)

Unless the font size is too small (height less than 15px), it is not necessary to do different sizes of training;

Absolutely not to mix multiple fonts in the same image file

(You can refer to the Boxtiff file sample in the download page)

Next Print and Scan (or use some electronic rendering method) to the Create an image of your training page. Upto training files can be used (of multiple pages). It's best-to-create a mix of fonts and styles (but separate files), including italic and bold.

Generate a TIFF file

2. Making box files

Generate the Box File command:

tesseract [Lang]. [Fontname].exp[num].tif [Lang]. [Fontname].exp[num] Batch.nochop Makebox

Cases:

Tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop Makebox

3. Get a new character set

    • Other

Reference Documentation:

API description in Doc directory after decompression

--end--

Tesseract 3.02 OCR text recognition survey record

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.