Releasenotes Analysis Notes for Tesseract

Source: Internet
Author: User

Releasenotes Release Notes.Updatedby [email protected] Introduction

This page keeps the most up-to-date release notes.

Tesseract Release Notes Feb 4 = V3.03 (RC1). "The latest version has to be compiled from the code, which is where the competition is."
    • Added New training tool text2image to generate Box/tif file pairs from text and TrueType fonts.
    • Added support for PDF output with searchable text.
    • removed entire image class and all code in image directory.
    • tesseract executable:support for output to stdout, limited support for one page images from stdin (especially in wind OWS)
    • Added Renderer to APIs to allow document-level processing and output of document formats, like HOCR, pdf.
    • Major Refactor of word-level recognition, beam search, eliminating dead code.
    • Refactored classifier to do it easier to add new ones.
    • generalized feature extractor to allow feature extraction from greyscale.
    • improved Sub/superscript treatment.
    • improved baseline fit.
    • Added set_unicharset_properties to training tools.
    • many bug fixes.
    • more training source data included.
Tesseract Release Notes Oct 2012-v3.02.02
  • Moved Resultiterator/pageiterator to Ccmain.
  • Added Right-to-left/bidi capability in the output iterators for hebrew/arabic.
  • Added paragraph detection in layout Analysis/post OCR.
  • Fixed inconsistent xheight during training and over-chopping.
  • Added simultaneous multi-language capability.
  • Refactored top-level word recognition module.
  • Added Experimental equation detector.
  • Improved handling of resolution from input images.
  • Blamer module added for error analysis.
  • Cleaned up externally used namespace by removing includes from baseapi.h.
  • Removed dead memory management code.
  • Tidied up constraints on control parameters.
  • Added support for shapetable in classifier and training.
  • Refactored class Pruner.
  • Fixed training leaks and randomness.
  • Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tab Stop finding.
  • Improved line detection and removal.
  • Added fixed pitch chopper for CJK.
  • Added Unicharset to Werd_choice to make mult-language handling easier.
  • Fixed problems with internally scaled images.
  • Added page and Bbox to string on TR files to identify source of training data better.
  • Fixes to Hindi Shiroreka Splitter.
  • Added word bigram correction.
  • Reduced stack memory consumption and eliminated some ugly typedefs.
  • Added new uniform classifier API.
  • Added New Training error counter.
  • Fixed endian bug in Dawg Reader.
  • C API (Thanks to Tobias Müller)
  • New solution for VS. (Thanks to Tom Powers)
  • Many other fixes, including the on which the chopper finds chops and messes with the outline when it does so.
Tesseract Release Notes Oct 2011-v3.01
  • thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now Thread-safe (multiple instances can being used in parallel in multiple threads.) with the minor exception th At some control parameters is still global and affect all threads.
  • Added Cube, a new recognizer for Arabic. Cube can also is used in combination with normal tesseract for other languages with a improvement in accuracy on the cost of (much) lower speed. There is no training module for Cube yet.
  • ocrenginemode in Init replaces accuracyvspeed to control cube.
  • Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
  • Added pageiterator and Resultiterator as cleaner ways to get the full results out of tesseract, which is Not currently provided by any of the tessbaseapi::get* methods. All other methods, such as the etext_struct in particular is deprecated and would be deleted in the future.
  • Applyboxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of Chara Cter boxes, but to use the so you has to have already boostrapped the language with character boxes. "Cyclic dependency" on Traineddata.
  • Auto orientation and script detection added to page layout analysis.
  • Deleted lots of dead code.
  • FIXXHT module replaced with Scalable Data-driven module.
  • Output font characteristics accuracy improved.
  • Removed the double conversion at each classification.
  • Upgraded oldest structs to be classes and deprecated pblob.
  • Removed non-deterministic baseline fit.
  • Added fixed length Dawgs for Chinese.
  • Handling of vertical text improved.
  • Handling of leader Dots improved.
  • Table detection greatly improved.
  • Fixed a couple of memory leaks.
  • Fixed font labels on output text. (Not perfect, but a lot better than before.)
  • Cleanup and more bug fixes
  • Special treatments for Hindi.
  • Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
Tesseract Release Notes Sep 2010-v3.00
    • Preparations for thread safety:
      • Changed Tessbaseapi methods to be non-static
      • Created A class hierarchy for the directories to hold instance data, and began moving code into the classes.
      • Moved thresholding code to a separate class.
    • Added major new Page layout analysis module.
    • Added HOCR output.
    • Added Leptonica as main image I/O and handling. Currently optional, but in the future releases linking with Leptonica would be mandatory.
    • Ambiguity table rewritten to allow definite replacements in place of Fix_quotes.
    • Added Tessdatamanager to combine data files into a single file.
    • Some dead code deleted.
    • Vc++6 no longer supported. It can ' t cope with the use of templates.
    • Many more languages added.
    • Doxygenation of the most of the function header comments.
Tesseract Release Notes June 2009-v2.04
    • Integrated patches for portability and to remove some of the "access" macros.
    • Removed dependence on LUA from the viewer making it a lot faster. Also the viewer now compiles and works (on Linux.) Also works on Windows via a pre-built Scrollview.jar.
    • Fixed the following issues:1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135, 142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209.
    • The last version of the support vc++6!
    • This may also is the last version to compile without leptonica!
    • Windows version now outputs to stderr by default, fixing a lot of the problems with lack of visible meaningful error Messa Ges.
Tesseract Release Notes April 2008-v2.03

2.02 was unrunnable, due to a last-minute ' simple ' change. 2.03 fixes the problem and also adds an include check for Leptonica to make it more usable.

Tesseract Release Notes April 2008-v2.02
    • Improvements to clustering, training and classifier.
    • Major internationalization improvements for large-character-set languages, eg Kannada.
    • Removed some compiler warnings.
    • Added MultiPage TIFF support for training and running.
    • Updated graphics output to the new java-based viewer.
    • Added ability to save n-best lists.
    • Added Leptonica Support for more file types.
    • Improved init/end to make them safe.
    • Reduced memory use of dictionaries.
    • Added some new APIs to Tessbaseapi.
    • Fixed namespace collisions with JPEG library (INT32).
    • Portability fixes for Windows for new code.
    • Updates to autoconf system for new code.
Tesseract Release Notes, 2007-v2.01.

(See also release notes for 2.00 below for usage information)

No major functionality change. Just a bunch of bug fixes.

    • Fixed UTF8 input problems with box file reader.
    • Fixed various infinite loops and crashes in Dawg code.
    • Removed include of config_auto.h from Host.h.
    • Added automatic wctype encoding to Unicharset_extractor.
    • Fixed Dawg table too full error.
    • Removed svn files from Tarball.
    • Added new functions to Tessdll.
    • Increased maximum UTF8 string in a classification result to 8.
    • Added new functionality to TESSBASEAPI for Ocropus.

No new data files for the original 6 languages. Use the files from v2.00. There is new data files for German Fraktur (deu-f) and Brazillian Portuguese (POR).

STOP Press There is a minor the bug in Unicharset_extractor. Since this was only applicable to training, the main tarball are fine unless you need to run training, in which case, OVERWR Ite your unicharset_extractor.cpp and unicharset_extractor.exe with the ones in tesseract-2.01.patch1.tar.gz.

Tesseract Release Notes Jul, 2007-v2.00.

(See also release notes for 1.04 below for additional usage information)

First release of the international version. This version recognizes the following languages:

    • English-eng
    • French-fra
    • Italian-ita
    • German-deu
    • Spanish-spa
    • DUTCH-NLD [China is not the first to support, but to see who adds to China]
The language codes follow ISO 639-2. The default language is 中文版. To recognize another language:

-L Langcode

To train in a new language, see Trainingtesseract. More languages'll be appearing over time.

List of changes in this release:

    • Converted internal character handling to UTF8.
    • Trained with 6 languages.
    • Added Unicharset_extractor, Wordlist2dawg.
    • Added boxfile creation mode.
    • Added UNLV regression test capability.
    • Fixed problems with copyright and registered symbols.
    • Fixed extern "C" declarations problem.
    • Made some improvements to consistency of accuracy across platforms.
    • Added VC + + express support.

xx.00 Version Warning

Tesseract 2.00 have undergone more compatibility testing than any previous version. There has even been fixes to make the accuracy more consistent across platforms. Has said that, there has been many changes to the code, and portability may has been broken, so the bit and Mac PLATFO RMS may isn't work or even build as well as before.

Tesseract Release Notes May, 2007-v1.04.

Windows users only "original version only supports Windows"

Added a DLL interface for Windows. Thanks to Glen at Jetsoft for contributing. To use the DLL, include tessdll.h, import tessdll.lib and put Tessdll.dll somewhere where the system can find it. There is also a small Dlltest program to test the DLL. Run with:

Dlltest phototest.  TIF phototest.  TXT   

It'll output the text from phototest.tif with bounding box information.

New for Windows

The distribution now includes Tesseract.exe and Tessdll.dll which might work out of the box! There is no guarantees as you need VC++6 versions of MFC and the CRT (at least) for it-to-work. (Batteries not included, and certainly no InstallShield.)

Important Note for anyone building with make:i.e. Anyone except DevStudio users

This release includes new standardization for the data directory. To enable Tesseract to find it data files, you must either:

./Configure
Make
Make install

To move the data files to the standard place, or:

Export tessdata_prefix="directory in which your tessdata resides/" "date has always been a very important thing, indeed this is the essence of OCR"  

(or equivalent) in your. Profiles or whatever or setenv to set the environment variable. Note that the directory must end in A/

Have Tesseract and tessdata in the same DIRECTORY DOES not work any more.

All Users

Fixed a bunch of name collisions-mostly with STL. Made some preliminary changes for Unicode compatibility. Includes a new data file (Unicharset) and renaming of the other data files to Eng. To support different languages. There is also several other minor bugs fixes and portability improvements for a bit, the latest Visual Studio compiler ET C. Thanks to all who has contributed these fixes.

Note:this is likely to being the last english-only release! Apologies in advance to non-windows users for bloating the distribution with Windows executables. This would probably get fixed in the next release with the multi-language capability, since that'll also bloat the Distri Bution.

Releasenotes Analysis Notes for Tesseract

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.