YunOS Scene text recognition

Source: Internet
Author: User
Tags manual mixed prepare

Absrtact: This article first introduces the common method of Word recognition, and then introduces the progress, results and technical scheme of YunOS in scene character recognition. It will focus on the two main parts of the technical solution: 1) The text line detection method based on the whole convolution network, from the local to the whole, 2) the Text line recognition scheme based on BLSTM-CTC-SEQ2SEQ. 1 Overview

With the development of deep learning techniques, especially the development of convolutional neural networks (convolutional neural networks,cnn) and recurrent neural networks (recurrent neural networks,rnn), word recognition (Optical Character RECOGNITION,OCR) and scene text recognition (scenes text recognition,str) have developed rapidly in recent years. Word recognition is divided into two main steps: text detection and word recognition. Among them, the text detection mainly has the method based on the stroke characteristic (stroke Width TRANSFORM,SWT), based on the stable region (maximally Stable extremal region,mser) method and based on the full convolution network (Fully Convolutional networks,fcn) method. Word recognition is mainly divided into character/Word classification method and sequence-based recognition method.

Unlike traditional OCR services, YunOS is more concerned with the content in user photos, which is the word recognition technology in natural scenes. Therefore, our technical solutions pay more attention to the diversity of training samples and the performance of the basic classification model, and take less account of the prior structure of the document.

The following is a brief introduction to the existing methods, and then introduce our methods and access methods. 2 Existing methods 2.1 Text Detection

As mentioned above, the word detection method is mainly divided into three categories: based on the stroke feature, based on the stable region and the convolutional network. The Representative method is as follows.

1) Stroke Width TRANSFORM,SWT [1]

SWT assumes that the stroke width of each character is roughly the same, so from a canny edge point along the gradient direction, the stroke width is considered valid if you can find the opposite edge point in the gradient direction. After detecting all the edge points, the text confidence can be obtained based on the stroke width filtering. The process is shown in Figure 1 and figure 2 below. This method is simple and effective, but it can cause many false alarms in the natural scene. These false alarms usually appear in areas of similar text, such as: strips, windows, bricks, grids, and so on.


Figure 1-Local gradient direction of text
The process of Figure 2-SWT algorithm

2) maximally Stable extremal region,mser [2-3]

Mser first extracts the maximal stable extremum region of the image as a candidate, then filters out the illegal candidate region through the classifier, finally the filtered candidate region, through a series of post-processing and connection rules into the text line. The algorithm uses less prior knowledge, and is more robust in terms of language and text. However, similar to SWT, it is prone to false alarms in complex background, which affects the next steps.

Figure 3-text detection based on the maximum stable extremum region Mser

3) method of CNN and connected domain analysis based on text classification [4-5]

The scheme is to input the image into "text classification CNN", to obtain the corresponding text confidence chart, and then use the connected domain analysis to calculate the line of text. The CNN program is based on positive and negative sample training, need to cut graph, negative sample sampling and other steps, low efficiency, its performance is susceptible to the diversity of positive and negative samples.


Figure 4-method of CNN and connected domain analysis based on text classification

4) method based on FCN [6-9]

The scheme considers the detection problem as a generalized "segmentation problem", avoids the positive and negative sample preparation steps in 3), and predicts the text area mask in a end-to-end way (as shown in Figure 6 and Figure 7). On the basis of mask, coordinate regression, word segmentation, line analysis and so on, so as to get the final text line coordinates. The scheme has strong robustness to scale, direction and complex background, and it has become the mainstream method of word detection at present.


Figure 5-detection effect of approximate horizontal text

Figure 6-Detection of text at any angle

5) Joint approach based on FCN and RNN

On the basis of FCN, [10] The RNN is introduced into the text detection task to obtain a greater lateral sensation Field (receptive). The results show that this method is very good for horizontal text, but it is more difficult to generalize to any angle.


Figure 7-Joint Method 2.2 text recognition based on CNN and RNN

Word recognition methods are divided into three categories: character recognition based on shallow model, character/word recognition based on depth network, and sequence recognition. The Representative method is as follows.

1) Character recognition method based on shallow layer model

After cutting lines of text by character, word recognition can be considered a character classification problem. It is common practice to extract the descriptive characteristics of characters and then classify them using classifiers. The characters commonly used are local moments, HOG, sift, etc. [11-12]. In [13] Shi et people proposed a method based on the DPM (deformable part Model) text expression. DPM adapts to font changes and is robust to noise, blur, and more.


Figure 8-DPM Sample for numbers and English letters

2) Character/Word recognition method based on depth network

The recognition rate can be greatly improved by replacing the traditional artificial features with the deep network. The first approach is similar to traditional methods, using CNN to classify characters (see Figure 9) and the other to classify the words (Figure 10). The training of character classification networks is relatively easy, but the accuracy is relatively low due to the fact that semantic information is not considered at all.

CNN's powerful expressive power makes it possible to classify "word-level", which is generally larger in category, such as: English 90,000. [14] The use of this method of violence in the English recognition task achieved the best results at the time. There are two main deficiencies in this method: a) It relies on a pre-defined dictionary, does not recognize the word outside the dictionary (out-of-vocabulary problem); b) For too long words, the deformation of the input image is often larger, which will affect the recognition rate.



Figure 9-Chinese, English and digital mixed character recognition network

Figure 10-English word recognition network

3) Text recognition method based on sequence

In addition, the performance of character and word recognition methods relies heavily on the precision of text segmentation. In view of this problem, the method of Word recognition based on sequence emerges. This kind of method is very similar to the speech recognition method, it regards the line of words as a whole, does not make the segmentation, and identifies the sequence of characters directly in batch or increment. This method can make use of the context Association of text sequences to eliminate the irreversible errors caused by the segmentation of character errors. In this framework, the training set is also easier to prepare, just to label the entire line corresponding to the text content, do not need to label each character's specific location.

There are two main structures of Word line recognition, one is CNN+LSTM+CTC structure [15-16] (see Figure 11), the other is Cnn+lstm+seq2seq [17] (see Figure 12). The main body of the two methods is consistent, first using CNN to learn the relationship between neighboring pixels, and then adopt the bidirectional long-short-term Memory neural network (bidirectional long short-term memory,blstm) to learn the long-span context (the whole line of the Sensation field). Finally, CTC (connectionist temporal classification) and SEQ2SEQ are used as objective functions to optimize the parameters of the whole network.


Figure 11-English word recognition based on CNN+LSTM+CTC

Figure 12– 3 Technical scheme of English word recognition based on CNN +lstm+seq2seq

Our solution follows the current mainstream approach, which is divided into three steps: training data synthesis, word line detection, and word line recognition, with the flow as shown in Figure 13. The details and features of the detection and identification steps are described below, and the composition of the training data is interspersed.


Figure 13-Technical solution Flow 3.1 detection step 3.1.1 Training data Generation

The performance of deep networks relies heavily on large-scale and diverse data, while manual labeling of text data tends to be higher (about 0.1 yuan/text box), and there is a labeling speed problem (125/hour). In order to save costs and improve efficiency, we have adopted a "synthetic data-based, manual labeling of data as a supplement" strategy.

The synthesis process is basically consistent with [18], with the following steps: Prepare a large number of text-free background images prepare rich and varied font preparation rich corpus, row organization to the background image to do area segmentation and depth estimation random selection of background, font, region and other parameters, the random selection of text lines rendered to the background map Record the coordinates of each text box during the rendering process

The whole process was done automatically, and in a few days we synthesized more than 50W of training samples. As shown in Figure 14, it can be seen that the synthetic samples are closer to the real scene, which lays a good foundation for network training.


Figure 14-Synthetic text detection training data 3.1.2 Network training

As mentioned earlier, FCN is currently the mainstream word detection method. Another very popular object detection method faster-rcnn in the text detection of the effect is not satisfactory, the main reason is: the shape of the text is often a long bar, not too suitable for the use of the receptive Field ratio of nearly 1:1 network structure.

If the network feeling is too small, it is difficult to detect the longer text (Figure 15 green box), if the expansion of the field of feeling, the short side of the direction of a large waste, more difficult to distinguish the multiline text (Figure 15 red box). Because of the inability to predict the direction of the text, it is not feasible to construct a rectangular sensing scheme. For English words, the aspect ratio is generally within acceptable range (e.g. from 1:5 to 5:1), this contradiction will not be highlighted. And for Chinese or English mixed row, how to effectively detect strip objects is the key to improve performance.


Figure 15-The contradiction between the sense of the wild and the resolution

To solve this problem, we proposed in October 2016 a scheme that uses "FCN local block Detection" and "row generation under geometric constraints" to detect arbitrary-length lines of text (as shown in Figure 16). The main idea of the scheme is: The text line is divided into a local square box, using FCN to detect its good square block of text, while predicting the relationship between each block of text and neighbors.


Figure 16-Text line detection process

FCN Front is a common convolution, pooling and other layers, the latter is divided into three branches, respectively output: text confidence, local text frame coordinates, local angle. The confidence degree uses the two classification cross entropy as the objective function, the coordinate uses the L2 or L1 distance as the objective function, the angle uses the correlation (or the cosine similarity) as the objective function. The overall objective function is obtained from the above three weights.

In the test phase, a small number of candidate blocks (hundreds of) are filtered out based on the confidence level, and then based on the mutual relations between the candidate blocks (graph), the geometric relationship can include: distance, relative size, angle consistency, etc. finally, the graph is segmented and the simplest connected sub-graph method can be used. Complex methods such as graphcut can also be used.

Figure 17 and Figure 18 Show in detail the intermediate and final detection results in this scenario.


Figure 17-Intermediate and final results of detection


Figure 18-Intermediate and final results of detection

The disadvantage of this scheme is that it divides the detection process into two separate steps: partial detection and merging, and the gap between the two steps leads to the accumulation of errors. Although the direction information of the network output provides an important clue to the merging operation, if we can connect the two steps organically, we believe that it will bring a great performance improvement. In addition, it can be seen from the above image that the FCN output of the text confidence has a checkerboard effect, this effect is generally caused by the deconvolution (or called transposed convolution) operation resulted in [19], through some smoothing techniques can be avoided, may be useful for performance.

It is noteworthy that the recent academia has also begun to use a similar method [8, 20] to detect the scene text, and achieved good results.   This kind of detection method from the local to the whole has a successful application in the human pose estimation problem [21]. 3.2 Identification Steps 3.2.1 Training Data Generation

Similar to text detection, we still use the "synthetic data-based, manual labeling data as a supplement" strategy in the identification step. For identification tasks, the equalization of various samples is very important, but the distribution of the characters in the real corpus or image is very uneven, often long tail distribution (as shown in Figure 19). Using the model of natural data training, the accuracy rate of characters commonly used is generally higher, while the accuracy of long tail is low. To improve the accuracy of long tail words, synthesis is almost the only feasible solution.


Figure 19-A typical long tail distribution of Chinese, English, and other characters

We use a similar step with 3.1.1 to synthesize lines of characters, including Chinese characters, letters, punctuation, numbers, and other 6803 types of character, adding blur, tilt, perspective, stretch, shadow, stroke, add frame, random noise and other changes, the final synthesis of about 500W lines of text pictures. In the process of synthesis, a training sample with almost uniform distribution can be obtained by sampling the corpus. Examples are shown in Figure 20.


Figure 20-Synthetic text line recognition training data 3.2.2 Network training

As mentioned earlier, the sequence-based Word line recognition method is the current mainstream, with the best performance in a natural scenario. The objective function is generally divided into CTC and seq2seq two kinds, the advantage of CTC is fast convergence speed, seq2seq convergence speed is slow but its accuracy is generally higher than CTC. We refer to the recent Merl approach in speech recognition [22], combining CTC with SEQ2SEQ and introducing attention mechanism.

The overall network structure is shown in Figure 21, where the encoder section consists of CNN and Blstm, which is responsible for converting the image to an abstract feature representation, and the decoder part is responsible for decoding characters from the feature. Decoder is divided into two branches, the CTC branch on the side of the registration side to calculate the loss between the feature and the label, the Seq2seq branch uses the attention mechanism to focus on some parts of the feature and decode the characters chronologically. The experiment proves that the accuracy of the joint training Scheme is better, and the convergence speed is equivalent to CTC.


Figure 21-joint Ctc+seq2seq (with Attention) 3.3 Icdar RRC

To verify the effectiveness of the scheme, we tested and submitted the performance of the English module in November 2016 in the three scenarios of Icdar RRC 2015 [23], where it was ranked first and two ranked second. The result is shown in Figure 22.


Figure 22-Results under Icdar RRC Generic Protocol (time: November 2016)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.