I. BACKGROUND
With the widespread popularity of smartphones and the rapid development of mobile Internet, access, retrieval and sharing of information through the cameras of mobile devices such as mobile phones have gradually become a way of life. The camera-based (camera-based) application emphasizes the understanding of the shooting scene. Usually, in the scene where the text and other objects coexist, the user tends to pay more attention to the text information in the scene, so how to correctly recognize the text in the scene and have a deeper understanding of the user's shooting intention. In general, image-based text recognition includes optical character recognition based on scanned text (Optical Character recognition, OCR) and captcha widely used for site registration verification (Completely automated public Turing test to tell Computers and humans Apart, fully automated to differentiate between computer and human Turing tests). In comparison, scanner-based OCR is the simplest, captcha the most difficult, scene text recognition is between the two, 1 [1].
Figure 1 Image-based word recognition
The biggest difference between scene text and scanned text is that the background is often complex, and the text position is indeterminate for devices such as mobile devices or computers. Secondly, the impact of light on the text is also very large. At last, compared with the traditional OCR processing, many scene characters are more diversified and have larger inner class changes.
Ii. Two identification schemes
A natural idea is to first detect and locate the text area (word detection), and then send the detected block of text to the existing OCR for recognition (word recognition), but the problem of the above scene text is a challenge to this scenario. Essentially, this approach completely separates word detection and recognition, and relies heavily on word detection and segmentation performance.
In recent years, a very different point-to-point text localization and recognition system has gradually begun to arouse the attention of academia and industry. The system is based on the object recognition angle, simultaneously carries on the word detection and the recognition, has obtained the good effect in the scene character recognition. This article takes the English recognition as an example, briefly introduces the point-to-point character detection and recognition system.
Three, point-to-point scene character recognition system
Usually point-to-point systems usually include: a) character detection, B) simultaneous word detection and recognition.
1. Character Detection
Character recognition is primarily about determining whether an image block (image patch) is a character. The selection of image blocks can be obtained by multi-scale scanning with sliding windows (Sliding window) or by connected domain analysis (Connected Component analyses, CCA). Based on the sliding window method, the most classic application comes from face detection, but its biggest problem is that it will produce many candidate regions on the one hand and the confusion of character putting and word characters on the other. 2 is shown in [2]. A sliding window between two O is easily mistaken for X, while a half of the box's B is easily mistaken for E.
Figure 2 Word putting and characters inside confusion
However, the CCA-based approach is less complex, but it is susceptible to background interference and can not be used for fuzzy images. As the literature [3] uses the connected domain based on the polar region to form the candidate region of the text.
Usually the feature description of the image block is often used in histograms of oriented gradients (HOG), classifier can use support vector machines (supports vector machine, SVM), Neighbors (Nearest Neighbor, N N), AdaBoost and so on.
2. Simultaneous detection and recognition of words
Since character detection is generally based on bottom-up information, it is detected that the character candidate area contains a certain false positive. To this end, the word detection and recognition module, often need to use top-down information (such as dictionary information)[2,3,4].
In the literature [2], for character detection results, the conditional random Field (Conditional) is used to simulate the confidence level of character recognition, the relationship between characters (positional and semantic). The energy function of the CRF is defined as shown in the following formula.
The first one expresses the confidence level of a single candidate region, while the second describes the relationship between the two candidate regions, including the overlap relationship in the geometric position and the probability of two letters appearing in the Dictionary (Lexicon).
Figure 3 Simultaneous detection and recognition of words
With CRF, the words in Figure 3 can be accurately identified as door, rather than DOXR. In addition to Crf,wang and other [4] also borrowed pictorial structures, etc. to complete the detection and recognition of words.
Iv. comparison of the effectiveness of the scheme
To compare the two scenarios in the second section, table 1 shows the comparison of the three point-to-point systems mentioned above and the traditional OCR system (commercial software ABBYY, www.abbyy.com) recognition effect. The two datasets used are the street View text database (street views text)[1] and the Icdar database (http://algoval.essex.ac.uk/icdar/RobustWord.html), as shown in 4.
Figure 4 Example of SVT (left) and Icdar (right) databases
Obviously, point-to-point systems are better than traditional OCR recognition.
51-Point Thinking
The current point-to-point system is mostly based on the recognition of English, mainly because the English category is relatively small (class 62, 26 letters and 10 numbers), and for a large number of Chinese has always been the question of our thinking.
Reference documents
[1]. K. Wang and S. Belongie. Word spotting in the wild. In Proc. ECCV, 2010.
[2]. A. Mishra and K. Alahari. Top-down and bottom-up Cues for Scene Text recognition. In Proc. CVPR, 2012.
[3]. L. Neumann and J. Matas. Real-time Scene location and recognition. In Proc. CVPR, 2012.
[4]. K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proc. ICCV, 2011.
Analysis of scene character recognition of Point-to-point (end-to-end) (picture text)