Tesseract OCR 3.02 learning record One

Source: Internet
Author: User
Tags tesseract ocr

Optical character recognition (ocr,optical Character recognition) refers to the process of scanning text data, and then analyzing and processing the image files to obtain the text and layout information. OCR technology is very professional, generally many printing, printing industry practitioners use, can quickly convert paper data into electronic data. About Chinese OCR, the current domestic level of Tsinghua Wen Tong, Han Wang, Shang Shu, its products are not the same, the price is not cheap. The development of foreign OCR earlier, like some large companies, such as IBM, Microsoft, HP, etc., even without the introduction of separate OCR products, but their research and development team has mastered the core technology, the OCR function into its own software system. For our programmers, the general use of less advanced, mainly in the development of the integration of basic OCR functions can be. These two days I find a lot of free OCR software, class library, specially tidy up, today first to talk about Tesseract.

1. Tesseract Overview

Tesseract's OCR engine was first developed by HP Labs in 1985 and has become one of the most accurate three recognition engines in the OCR industry by 1995. However, HP soon decided to abandon the OCR business, tesseract also dust-laden.

A few years later, HP realized that instead of tesseract on the shelf, it was better to contribute to the open-source software industry to revive the--2005 year, tesseract by the Nevada Institute of Information Technology, and Google to improve tesseract, eliminate bugs, Optimization work.

Tesseract is currently published as an open source project on Google Project, with its latest version 3.0, which already supports Chinese OCR and provides a command-line tool. This time we will test Tesseract 3.0, because the command line is not very friendly to the end user, I use WPF simple encapsulation, you can easily do Chinese OCR.

1.1, first to tesseract project home page Download command line tools, source code, Chinese language pack:

1.2. The command line tool is decompressed as follows (1.jpg, 1.txt not included):

1.3. For Chinese OCR, copy the Simplified Chinese language pack to the "Tessdata" directory:

1.4, in DOS switch to tesseract command line directory, look at the tesseract.exe command format:

ImageName for the image to be OCR, outputbase as the output file after OCR, the default is a text file (. txt), Lang for the use of the language pack, ConfigFile for the configuration file.

1.5, the following to test, prepare a JPG format picture, here I put in and tesseract in the same directory:

Input: Tesseract.exe 1.jpg 1-l Chi_sim, then enter, a few seconds on OCR completed:

Note here the format of the command: ImageName to add the extension. jpg, the output file and the language pack do not need an extension.

OCR results:

Can see the result is not very ideal, Chinese recognition also said the past, but the English, the number is mostly garbled. But as a veteran OCR engine, can do this degree has been quite good, look forward to the follow-up Google upgrade, support.

Tesseract OCR 3.02 learning record One

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.