Atitit the principle and introduction of OCR recognition Attilax summary

Source: Internet
Author: User

Atitit The principle and introduction of OCR recognition Attilax Summary

1.1. process and process of OCR 1

1.2. Ocr Different technical subdivisions are slightly different, But the principle is probably the same. That is, the main technical process is: Two value (also known as normalization) -------- row positioning ---------- word segmentation min ---------- font model (higher confidence) --------- output 2

1.3. tesseract picture layout analysis character segmentation and recognition 2

1.1. process and process of OCR

preprocessing: The image containing the text is processed for subsequent feature extraction and learning. The main purpose of this process is to reduce the useless information in the image so as to facilitate subsequent processing. In this step there are usually: grayscale (in the case of color images), noise reduction, binary, word segmentation, and normalization of these sub-steps. After two value, the image is left with only two colors, namely black and white, one of which is the image background and the other is the text to be recognized. Noise reduction is very important at this stage, and the noise reduction algorithm has a great influence on feature extraction. The word segmentation divides the text in the image into a single text -the recognition is a word recognition. Tilt correction is often necessary if the line of text is tilted. Normalization is a single text image is structured to the same size, under the same specification, to apply a unified algorithm.

feature extraction and dimensionality reduction: features are key information used to identify words, each of which can be distinguished by features and other text. For numbers and English letters, this feature extraction is relatively easy, because the number only 10, the English alphabet only 52, are small character sets. For Chinese characters, feature extraction is difficult, because the first character is a large character set, the national standard is the most commonly used in the first level of Chinese characters there are 3,755, the second Chinese character structure is complex, the shape of the word more. After determining what characteristics to use, depending on the situation, it is also possible to feature dimensionality reduction, which is the case that if the dimension of the feature is too high (the feature is generally represented by a vector, the dimension is the number of components of the vector), the efficiency of the classifier will be greatly affected, in order to improve the recognition rate, it is often necessary to This process is also important to reduce the number of dimensions, but also to reduce the number of dimensions after the eigenvector has retained enough information (to distinguish between different words).

classifier Design, training and actual recognition: The classifier is used to identify, that is, for the second step, you to a text image, extract features to the classifier, the classifier to classify it, tell you this feature to identify the text. The classifier is often trained before the actual identification, which is a case of supervised learning. Mature classifiers are also many, what svm,kn, neural networks et

Post-processing : post-processing is used to optimize the classification of the results, the first, the classification of the classifier is sometimes not necessarily completely correct (in fact, can not be completely correct), such as the recognition of Chinese characters, because of the shape of the character near the existence of the word, it is easy to identify a word into its shape near the word. Post-processing can be used to solve this problem, such as through the language model to correct -if the classifier will be "where" to identify "where", through the language model will find "where" is wrong, and then corrected. Second, the OCR recognition image is often a large number of text, and the text exists in typesetting, font size and other complex situations, post-processing can try to format the recognition results, such as in the image of the layout of what, for example, a chestnut, an image, the left half of the text and the right half of the text has no relationship, In the process of segmentation, the first line of the left half of the recognition result is followed by the first line of the right half and so on.

 

1.2. OCR Different technical subdivision slightly different, but the approximate principle is the same. That is, the main technical process is: Two value (also known as normalization)-------- line positioning ---------- Word segmentation points ---------- font model comparison (with higher confidence) --------- Output

1.3. Right now in the company doing OCR and STR, Now the mainstream method is CNN(based on Featuremap text detection) +lstm (text line recognition of any sequence)

, ICDAR2015 Text Contest Top of the results are basically this method, in addition to the master if you want to achieve end to end of training and prediction can directly consider the simple violence of the FASTERRCNN, The results are filtered by CNN to achieve Icdar several challenge top3.

1.4.tesseract  Image Layout Analysis character segmentation and recognition

·

powerful engine, can be divided into two parts in a nutshell :

Image layout Analysis, is the preparation of character recognition. Work content: Through a hybrid page layout analysis method based on tab-stop detection, the image of the table, text, pictures and other content to distinguish.

character segmentation and recognition is the entire Tesseract's design goals, the most complex work content. The first is character cutting, Tesseract adopts two-step strategy:

·  use the interval between characters to rough slicing. to get most of the characters, as well as sticky characters or incorrectly sliced characters. This will be the first character recognition, by the character region type determination, according to the result of comparison character repertoires character recognition.

·  according to the identified characters, the segmentation of the adhesion character, and the wrong division of the word in line with, complete Fine-grained segmentation of characters .

there is, of course, another way of saying -- finely divided into four parts :

·  Analyze connected areas

·  Find block area

·  Find text lines and words

·  Draw (identify) text

1.5.the process of printing Chinese character recognition mainly includes:


(1) scan the input text image;

(2) preprocessing of images;

(3) Image layout analysis and understanding;

(4) segmentation and segmentation of images;

(5) Feature selection and extraction based on single image;

(6) pattern classification based on character of single image;

(7) The classification mode is assigned to the recognition result;

(8) editing and modification of the result of the recognition process.

preprocessing includes the removal of the apparent noise (interference) from the original image, the tilt correction of the scanned text line, and so on. Layout analysis is an overall analysis of text images, distinguishing between text paragraphs and typesetting sequences, as well as areas of images and tables. For the text area will be recognized processing, for the table area for a dedicated table analysis and identification processing, the image area for compression or simple storage. Line segmentation is the process of cutting large images first into rows, and then separating individual characters from the image rows. Feature extraction is the most important link, which is the process of extracting statistical features or structural features from a single character image, including the refinement, normalization and so on. The stability and validity of extracting feature directly determine the performance of recognition. Word recognition is the process of finding the character class with the highest similarity to the characters from the existing feature library. The post-processing is the process of correcting the recognition result by using the transcendental knowledge of language such as word meaning, frequency, grammar rule or corpus.

in this whole process, steps 4, 5 and 6 are the most important techniques in the printing of Chinese character recognition. The pattern expression form of Chinese characters and the corresponding dictionary formation methods have many kinds, each form can choose different features, each feature has different extraction methods, which makes the method and criterion of discrimination and the mathematical tools used are different, and forms a variety of Chinese character recognition methods with various forms. In general, different feature extraction and classifier design methods determine the identification system using different processing methods, usually can be divided into structural pattern recognition method, statistical pattern recognition method, statistics and structure of the combination of recognition methods and artificial neural network method.

1.6. character recognition: This research, already very early things, relatively early template matching, and then feature extraction as the main

, because of the displacement of the text, the thickness of the strokes, broken pens, adhesion, rotation and other factors, greatly affect the difficulty of extracting features

Character Cutting:

Due to the limitations of photo conditions, often caused by character adhesion, broken pen, so greatly limiting the performance of the recognition system, which requires word recognition software has character cutting function.

OCR----Tesseract 3.x architecture and Principle Analysis - Blog channel -CSDN.NET.html

(1 messages ) OCR what algorithm is used for word recognition? - know . html

Kanji Basic principles of OCR _ Taste Xuan _ Sina blog . html

Brief Introduction how OCR text recognition works. _ Baidu knows . html

author::  Nickname :Old Wow's claws( Full Name::AttilaxAkbar Al Rapanui Attilaksachanui) 

Kanji Name: Etila ( Ayron) , email:[email protected]

reprint Please indicate source: http://www.cnblogs.com/attilax/

Atiend

Atitit the principle and introduction of OCR recognition Attilax summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.