Brief introduction
Optical character recognition (ocr,optical Character recognition) refers to the process of scanning text data, and then analyzing and processing the image files to obtain the text and layout information. OCR technology is very professional, generally many printing, printing industry practitioners use, can quickly convert paper data into electronic data. About Chinese OCR, the current domestic level of Tsinghua Wen Tong, Han Wang, Shang Shu, its products are not the same, the price is not cheap. The development of foreign OCR earlier, like some large companies, such as IBM, Microsoft, HP, etc., even without the introduction of separate OCR products, but their research and development team has mastered the core technology, the OCR function into its own software system. For our programmers, the general use of less advanced, mainly in the development of the integration of basic OCR functions can be.
Tesseract Overview
Tesseract's OCR engine was first developed by HP Labs in 1985 and has become one of the most accurate three recognition engines in the OCR industry by 1995. However, HP soon decided to abandon the OCR business, tesseract also dust-laden. A few years later, HP realized that instead of tesseract on the shelf, it was better to contribute to the open-source software industry to revive the--2005 year, tesseract by the Nevada Institute of Information Technology, and Google to improve tesseract, eliminate bugs, Optimization work. Tesseract is currently published as an open source project in Google Project (now hosted GitHub), where the project homepage is viewed, and version 3.0 already supports Chinese OCR and provides a command-line tool.
1. Quoting Tesseract
1.1. Create an empty project solution
"C #" TESSERACT-OCR 3.0. Version 2 usage examples