Optical character recognition (ocr,optical Character recognition) refers to the process of scanning text data, and then analyzing and processing the image files to obtain the text and layout information. OCR technology is very professional, generally many printing, printing industry practitioners use, can quickly convert paper data into electronic data. About Chinese OCR, the current domestic level of Tsinghua Wen Tong, Han Wang, Shang Shu, its products are not the same, the price is not cheap. The development of foreign OCR earlier, like some large companies, such as IBM, Microsoft, HP, etc., even without the introduction of separate OCR products, but their research and development team has mastered the core technology, the OCR function into its own software system. For our programmers, the general use of less advanced, mainly in the development of the integration of basic OCR functions can be. These two days I find a lot of free OCR software, class library, specially tidy up, today first to talk about Tesseract, the next one will discuss the OCR API implementation in OneNote 2010. A brief history of the development of OCR technology can be seen here.
Test code Download
Reprint Please specify source: http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html
1. Tesseract Overview
Tesseract's OCR engine was first developed by HP Labs in 1985 and has become one of the most accurate three recognition engines in the OCR industry by 1995. However, HP soon decided to abandon the OCR business, tesseract also dust-laden.
A few years later, HP realized that instead of tesseract on the shelf, it was better to contribute to the open-source software industry to revive the--2005 year, tesseract by the Nevada Institute of Information Technology, and Google to improve tesseract, eliminate bugs, Optimization work.
Tesseract is currently published as an open source project in Google Project, where its Project home page is viewed, and its latest version 3.0 already supports Chinese OCR and provides a command-line tool. This time we will test Tesseract 3.0, because the command line is not very friendly to the end user, I use WPF simple encapsulation, you can easily do Chinese OCR.
1.1, first to tesseract project home page Download command line tools, source code, Chinese language pack:
1.2. The command line tool is decompressed as follows (1.jpg, 1.txt not included):
1.3. For Chinese OCR, copy the Simplified Chinese language pack to the "Tessdata" directory:
1.4, in DOS switch to tesseract command line directory, look at the tesseract.exe command format:
ImageName for the image to be OCR, outputbase as the output file after OCR, the default is a text file (. txt), Lang for the use of the language pack, ConfigFile for the configuration file.
1.5, the following to test, prepare a JPG format picture, here I put in and tesseract in the same directory:
Input: Tesseract.exe 1.jpg 1-l Chi_sim, then enter, a few seconds on OCR completed:
Note here the format of the command: ImageName to add the extension. jpg, the output file and the language pack do not need an extension.
OCR results:
Can see the result is not very ideal, Chinese recognition also said the past, but the English, the number is mostly garbled. But as a veteran OCR engine, can do this degree has been quite good, look forward to the follow-up Google upgrade, support.
2. Using the WPF Encapsulation tesseract command line
2.1. Given that command line writing is error-prone and unfriendly to end-users, I made a simple WPF applet that encapsulates the tesseract command line:
Select Image, preview on the left, select Output directory on the right, show OCR results, support local and network Image preview.
2.2, in order to make the picture preview support Zoom, move, originally intended to use Microsoft's Zoom It API, unfortunately does not support WPF, so used a third-party class:
picture Zoom, Move Tool class
2.3, in addition to using the mouse. You can also use the scroll bar to adjust the picture preview effect:
Data Binding
2.4, because the tesseract command line does not support direct OCR network pictures, so first download:
image Download
2.5. Use process to invoke the Tesseract command line:
call the tesseract command line
2.6. Test the local Image:
2.7. Test the network Image:
Summary:
This time we briefly discussed the use of the next tesseract, as an open source, free OCR engine, can support the Chinese is very rare. Although the recognition effect is not ideal, it is sufficient for small and medium-sized projects that are not very demanding. Here is a free list of OCR tools, interested friends can study. The next time you'll test the OCR feature in OneNote 2010 and how to invoke its API for your project.
Go TESSERACT-OCR (Tesseract's OCR engine was first developed by HP Labs in 1985)