Introduction and development of OCR technology

Source: Internet
Author: User

first, the development process of OCR technology
Since the advent of the first generation of OCR products in the early 1960s, after more than 30 years of continuous development and improvement, including the handwriting of various OCR technology has achieved remarkable results, the functional requirements of OCR products from the original simple focus on the recognition rate, development to the whole OCR system recognition speed, The user interface is friendly, easy to operate, product stability, adaptability, reliability and ease of upgrade, pre-sales service quality, etc. put forward higher requirements.
IBM first developed OCR products and exhibited IBM's OCR product--ibml287 at the World Exposition in New York in 1965. At that time, the product could only recognize printed numerals, English letters and some symbols, and must be the specified font. In the late 1960s, Hitachi and Fujitsu developed their own OCR products respectively. The world's first letter automatic sorting system for handwritten ZIP code recognition was developed by Toshiba Corporation of Japan, and the same system was introduced by NEC two years later. By the year 1974, the automatic sorting rate of correspondence reached about 92%, and it was widely used in the postal system, which played a better role. In 1983, Toshiba Corporation of Japan issued its OCR system OCRV595, which recognized the Japanese characters of printing, with a recognition rate of 99 70~100 characters per second. 5%. Later, Toshiba started the research work of handwritten Japanese character recognition.
China's research work on OCR technology started relatively late, and in the 1970s it began to study the recognition techniques of digital, English letters and symbols, and began to study Chinese character recognition in the late 1970s. In 1986, the National 863 program information Field organized the development of Chinese OCR software jointly by three units of Tsinghua University, Beijing Information Engineering College and Shenyang Automation Institute. To 1989, Tsinghua University pioneered the first set of Chinese OCR software-Tsinghua Wen Tong th-ocr1.0 version, this Chinese OCR formally from the laboratory to the market. Tsinghua OCR Printing Chinese character recognition software then introduced the TH-OCR 92 high-performance practical simplified/traditional, multi-font, multi-functional printing Chinese character recognition system, so that the printing of Chinese character recognition technology has made significant progress. By 1994, the TH-OCR 94 high-performance Chinese-English mixed-print text recognition system, the experts identified as "is the first Chinese-English mixed print text recognition system at home and abroad, the overall ranking of the world's leading level." In the middle and late 90, Tsinghua University's Department of Electronic Engineering proposed and carried out a comprehensive study of Chinese character recognition, which made the Chinese character recognition technology make important achievements in the fields of printed text, online handwritten Chinese character recognition, offline handwritten Chinese character recognition and offline handwritten numeral symbol recognition. The representative result is TH-OCR 97 integrated Chinese character recognition system, which can complete the recognition input of multilingual (Chinese, English, Japanese) printed text, online handwritten Chinese characters, offline handwritten Chinese characters and handwritten numerals. Over the past few years, in addition to Tsinghua Wen Tong Th-ocr, other such as SH-OCR, such as the various styles of OCR software has come out, the Chinese OCR market steadily expanded, users all over the world.
It can be said that the printing OCR recognition technology has reached a high level. OCR products have been identified by the early identification of only the number of printed numerals, English letters and part of the symbol, developed into automatic layout analysis, form recognition, the realization of mixed text, multi-font, multi-font, vertical and other identification of the powerful computer information fast input tool. The recognition rate of printed Chinese characters reaches more than 98%, even if the quality of printed text is less than 95%. can recognize the song body, blackbody, italics, Fangsong and other fonts of simple, traditional, and can be a variety of fonts, different font size mixed typesetting recognition, the recognition rate of handwritten Chinese characters reached more than 70%. In particular, China's Chinese character OCR technology after more than 10 years of efforts to overcome the start of the late, the character set is unusually large and difficult, the recognition speed of the word (refers to in the unit time from the feature extraction to the recognition result output of the word) can reach more than 70 words/second. OCR has been widely used in news, printing, publishing, library, office automation and other industries because of the mature character recognition technology of printed OCR.
Professional OCR products are oriented to a specific industry, that is, to deal with a large number of daily forms of information input departments, such as postal, tax, customs, statistics and so on. This industry-oriented professional OCR system, the format is more fixed, the identification of the character set is relatively small, often with the use of dedicated input devices, so it has a fast, high efficiency features, such as automatic mail sorting system.
Handwriting recognition is not available until 1996 and 1997, and is provided as an additional feature of printed document recognition products. Because of the variety of people's writing habits, the realization of free handwriting recognition is very difficult, so the use of handwritten OCR technology is online handwriting recognition, that is, one side of writing, computer recognition, is a real-time recognition method.
second, the basic principle of OCR

In short, the basic principle of OCR is that the image of a document is inputted to the computer by a scanner, and then the computer takes out the image of each text and converts it into the encoding of the Chinese character. The specific work process is that the scanner converts the optical signal of the document into electrical signals through a charge-coupled device CCD, which is converted to a digital signal via a analog/digital converter to the computer. A computer accepts a digital image of a document whose characters may be printed or handwritten, and then the characters in these images can be identified. In the case of printed characters, the document is converted into the original black-and-white lattice image file by optical method, and then the text in the image is converted into text format by the recognition software for further processing of the word processing software. Among them, word recognition is an important technique of OCR.


1. Two ways of OCR recognition
As with other information data, all the scanners captured in the computer are recorded and identified by the two numbers of 0 and 1, all of which are just a bunch of dots or sample points saved in 0, 1. OCR Recognition program identifies the character information on the page, mainly through the unit pattern matching method and the feature extraction method in two ways to character recognition.
The cell pattern matching recognition method (pattern Matching) is a non-strict comparison of each character with a file that holds a bitmap of the standard font and size. If the application has a large database of saved characters, the application chooses the appropriate characters to match correctly. The software must use some processing techniques to find the most similar matches, usually by experimenting with different versions of the same character. Some software can scan a page of text and identify each character that defines a new font. Some software use their own identification technology, to the best of their ability to identify the characters on the page, and then the non-identifiable characters for manual selection or direct input.
Feature extraction recognition (Feature Extraction) is the decomposition of each character into a number of different character features, including slashes, horizontal lines, and curves. These features are then matched with the characters that are understood (identified). As a simple example, the application recognizes two horizontal lines, and it "thinks" that the character may be "two". The advantage of feature extraction method is that many kinds of fonts can be identified, for example, Chinese calligraphy body is to use feature extraction method to realize character recognition.
Most OCR applications incorporate the syntax intelligence Check feature, which further increases the recognition rate. It mainly through the context check to achieve spelling and grammar correction, in the word recognition, the OCR application will do a lot of contextual cohesion checks, according to the program already exist in the phrase, fixed word order, corresponding to check the string of words. The more advanced application software automatically replaces the wrong words with the words it "considers" correct, correcting the meaning of the sentences.
2. Several steps of word recognition
Word recognition includes the following steps: Text input, preprocessing, word recognition, and post-processing.
(1) Text input
Refers to the input device to enter the document into the computer, that is, to achieve the digitization of the original. The most commonly used device now is the scanner. The scanning quality of document image is the precondition of correct recognition of OCR software. Proper selection of scanning resolution and related parameters is the key to ensure that the text is clear and features are not lost. In addition, the document is placed as correct as possible to ensure that the pre-processing detection of the tilt angle is small, after the tilt correction, the text image deformation is small. These simple operations will improve the recognition accuracy of the system. Conversely, due to improper scanning settings, the text of the broken pen too many may be divided into half the text of the image. Text broken pen and stroke adhesion will cause some loss of features, in comparing its characteristics with the feature library, it will increase the feature distance, the recognition error rate increases.
(2) pretreatment
Scanning an image of a simple printed document, each text image is checked out to identify the module recognition, the process is called image preprocessing. Preprocessing refers to some of the preparations before the word recognition, including image purification, to remove the apparent noise (interference) from the original image. The main task is to measure the tilt angle of the document placement, the document layout analysis, the selected text field for typesetting confirmation, the horizontal, vertical layout of the text line, the separation of the text image of each line, punctuation marks and so on. This phase of the work is very important, the effect of processing directly affect the accuracy of word recognition.
Layout analysis is the overall analysis of the text image, is to check out all the text blocks in the document, to distinguish between the text paragraphs and typesetting sequence, as well as the image, table area. The domain boundary of each block (domain in the image, the starting point and the end point coordinate), the attribute (horizontal and vertical layout) and the connection relation of each block are used as a data structure, which is provided to the recognition module for automatic recognition. For the text area directly to identify processing, for the table area for the special table analysis and identification processing, the image area is compressed or simple storage. Line segmentation is the process of cutting large images first into rows, and then separating individual characters from the image rows.
(3) Word recognition
Word recognition is the core technology that embodies OCR character recognition. From the scanned text to detect the text image, by the computer to its graphics, images into the standard code of text, is to let the computer "read" The key, that is, the so-called recognition technology. Just as the human brain knows the words because of the various features of the text that have been preserved in the human brain, such as the structure of the text and the strokes of the text. In order for the computer to recognize the text, you need to first store the characters and other information to the computer, but to store what kind of information and how to obtain this information is a very complex process, and to achieve a very high recognition rate to meet the requirements. The usual practice is to analyze the Strokes, feature points, projection information and regional distribution of the points.
Chinese characters commonly used is thousands of, the identification technology is the feature comparison technology, through and identify the characteristics of the comparison, find the most similar characters, extract the text of the standard Code, that is, to identify the results. Comparison is one of the basic methods for people to know things, and Chinese character recognition is to find out the similarities and differences between Chinese characters, to grasp the relationship between quantity and quality, and the relationship between time and space. For large character sets, Chinese characters generally use multi-level classification, multi-feature, omni-directional dynamic matching to find similar sets, in order to ensure high classification rate, strong adaptability and good stability. The subdivision class focuses on the matching of similarity sets, weighted processing, structure discrimination, quantitative and qualitative analysis, and the relationship between the front and back joint words, and finally the discrimination. Chinese character recognition is essentially the application of comparative science or cognitive science in artificial intelligence, and its key technology is to identify feature library. The computer has such a characteristic library, can complete the function of the literacy.
In the image document layout, in addition to have the text, the picture, sometimes also has the table existence, in order to make the recognition table digitization, needs in the layout analysis process, the table domain carries on the special processing, it includes to the table line structure information extraction, to the table inside the text domain division, completes the table line and the text domain recognition, and generate different file formats based on the digitization of the table lines. Because the table in the document is arbitrary, the format is various, has the closed, also has the open, especially the diagonal line in the table, causes certain difficulty to the table analysis.
(4) Post-processing
Post-processing refers to the recognition of the text or a number of recognition results by means of a phrase to match, will be the word recognition of the results of Word segmentation, and the phrase in the Thesaurus comparison, in order to improve the system recognition rate, reduce the false recognition rate.
Chinese character recognition is the most difficult problem in the field of word recognition, it is a comprehensive technology, which involves pattern recognition, image processing, digital signal processing, natural language understanding, artificial intelligence, Fuzzy Mathematics, information theory, computer, Chinese processing and other disciplines. In recent years, the correct rate of word recognition for printing Chinese character recognition system has exceeded 95%, in order to further improve the overall recognition rate of the system, scanning image, image preprocessing and recognition of post-processing technology have been deeply researched, and made great progress, and effectively improve the overall performance of the printing Chinese character recognition system. Tsinghua University has become one of the most authoritative institutions in the world because of its outstanding research achievements. At present, Tsinghua Purple's full range of scanners are equipped with Tsinghua OCR Millennium Software, which in recognition rate, form recognition and even standard handwriting recognition, have reached a higher level.
three, OCR character recognition skills
In recent years, OCR recognition technology with the popularity of the scanner has been rapid development, scanning, recognition software performance is strong and constantly upgrade to intelligent development. But in order to get the right scan results quickly and get high-efficiency text input, we must study the knowledge carefully, combine the practical experience, and explore the complete solution. Sometimes we work in the recognition rate is very low, at all, not up to the software said more than 95%, please do not blame hardware or software, in fact, this is not a good scan and OCR recognition skills reasons.
Here are some methods and techniques that are often used in word recognition operations.
1. The setting of resolution is an important precondition of word recognition. Generally speaking, the scanner provides more image information, the recognition software is easier to obtain recognition results. But it's not the higher the scan resolution. The higher the recognition accuracy rate. Choose a 300dpi or 400dpi resolution for most document scanning. Note the text manuscript scanning recognition, set the scanning resolution must not exceed the scanner's optical resolution, otherwise it will outweigh the gains. The following are some typical settings, for reference only.
(1) 1, 2, 3rd words of the article paragraph, we recommend the use of 200dpi.
(2) 4, small 4, 5th words of the article paragraph, we recommend the use of 300DPL
(3) Small 5, 6th words of the article paragraph, recommended to use 400DPL
(4) 7, 8th words of the article paragraph, we recommend the use of 600dpi.
2. Adjust the brightness and contrast values appropriately when scanning to make the scanned file black and white. This has the most critical effect on the recognition rate, and the scanning brightness and contrast values are set to observe that the strokes of the Chinese characters in the scanned image are fine but continue to be the principle. Before the recognition, first look at the scanned image of the quality of the text, if the image has black spots or dark spots, or the text line is very coarse and dark, the brightness value is too small, you should increase the brightness value in the test; If the text lines are uneven, there is a broken line or even the image of a serious deformity, the brightness value , you should reduce the brightness and try again.
3. Select the scanning software. Choose a good for their own OCR software is the basis for a good word recognition work, generally do not use the scanner comes with the OEM software, the OEM's OCR software features less, the effect is poor, and some do not even Chinese recognition, after comparison, I think the Tsinghua Purple OCR2003 Professional Edition and the OCR6.0 text automatic recognition input system recognition ability and use function more prominent. Then choose an image software, OCR software does not have scanning interface? Why are you looking for an image software? First, OCR software does not recognize all scanners; second, and most critical, images scanned using the scanning interface of the image software are easy to handle; Photoshop is generally selected.
4. If the text to be made is formatted, such as bold, italic, first line indentation, and so on, some OCR software can not be recognized, the format will be lost or garbled. If you must scan text with formatting, make sure that the recognition software you are using supports text-format scanning in advance. You can also turn off the style recognition system, allowing the software to focus on finding the right characters, regardless of font and font formatting






Introduction and development of OCR technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.