Write in front:
I see the paper mostly for computer Vision, deep learning related paper, is now basically in the introductory phase, some understanding may not be correct. In the final analysis, the Little woman Caishuxueqian, if there are mistakes and understanding of the place, welcome to the great God criticism! E-mail:[email protected]
Thesis structure:
Abstract
1.Introduction
2.Related work
3.CNN Text Recognition Model
3.1 Character Sequence Model Review
3.2 Bag-of-n-gram Model Review
4.Joint Model
5.Evaluation
5.1 Datasets
5.2 Implementation Details
5.3 Experiments
6.Conclusion
"Deep structured Output Learning for unconstrained Text recognition"
1. Content Overview
This paper presents a natural scene picture of unconstrained (unconstrained) words (PS: This word does not know how to translate, text?). Words? Vocabulary? It feels strange, so with the original word, you should also be able to understand, is a string of characters in the recognition method. The so-called "unconstrained", refers to the absence of a fixed dictionary (lexion,ps: The word in the natural scene text recognition related to the paper often appear, also often say free-lexion), and do not know the length of worlds.
This paper presents a model of convolutional neural Network (convolutional neural nework,cnn) combined with conditional random FIELD,CRF (Conditional), which takes the whole word picture as input. One of the CRF is provided by CNN to predict the character of each location, and the higher order item is provided by another CNN to detect the existence of N-ary grammars (n-gram). This model (CRF, character predictor,n meta-grammar predictor) can be optimized by the overall reverse propagation of structured output losses, essentially requiring the system to multitask, while training requires only synthetic data generation.
The model presented in this paper is more accurate in the standard text recognition benchmark compared with the mere prediction of characters (meaning no grammar detection). In addition, the model obtained state-of-the-art accuracy in lexicon-constrained (which has a fixed dictionary, which knows the length) (although this method was proposed for Free-lexcion and unconstrainted, But the situation of lexion-constrained was also tested.
2. Methods
(1) CNN text recognition Model (CNN texts recognition models)
A. Character sequence models ( Character Sequencemodel)
A word w with a length of n is modeled as a sequence of characters:w= (c1,c2,..., CN), CI represents the I-character of this word, which is one of 10 numbers and 26 letter sets. A CI in Word can be predicted by a classifier. Because Word's length n is variable, the maximum Nmax of n is fixed to the longest word length (nmax=23) in the training set, and a null character class is introduced. Thus, a word can be represented as a string:.
for a given input image X, return the predicted word w* to maximize (Ps: This p should be understood as a probability, probability, or accuracy, overall confidence, the result of the prediction and the input image as close as possible). Assuming that the characters are independent, there are:
formula (1)
where (Ps: A personal understanding represents the predictive confidence of the I-character) is given by a classifier that acts as a group in the I-position of the shared CNN feature . Word w* takes the best possible character for each position to get :. Ps: The content of these two paragraphs is to quantify the input and output structure of the character sequence model using mathematical formulae. )
Figure 1 shows a model of the character sequence implemented by CNN (this model is described as CHAR). Word image is normalized to a uniform size (ignoring the aspect ratio) and then as input to the model . are entered into Nmax independent full-join layers, each character class contains non-characters. These fully connected layers use Softmax regularization, which can be expressed as the probability of entering an image x. This CNN uses multiple logistic regression losses (Multinomial logistic regression loss), reverse propagation (back-propagation), and random gradient descent (stochastic gradient descent, SGD) for training.
Figure 1 . Character sequence model. Identify a word image and spell out characters at a time by predicting the character output at each location.
The classifier for each location learns on its own, but shares a set of federated optimization features.
B.N meta-grammar combination models ( Bag-of-n-grammodel)
This one mainly discusses the second recognition model used in the text, It mainly detects the semantic synthesis of a word (compositonality) . Word can be seen as a set of N-ary grammatical combinations that do not require characters, namely bag-of-n-grams.
Here are some basic facts:
I. If and,s,W are two strings, then the substring of s is w .
Ii. a word w 's n-ary grammar refers to a substring of w length n, that is , the length of s .
III. represents a collection of substrings that have a w length less than or equal to n , that is , a collection of all n-ary grammars with a w length less than or equal to N. For example:.
Iv. represents a collection of all similar grammars in a language.
Although n is small, it is possible to encode each of them almost uniquely. For example, when n=4, there are only 7 mapping conflicts in a dictionary with 90k Word. Encodings can be represented as n-ary grammar Events | | A vector of two-dimensional values. This vector is very scattered, on average, at that time.
Use CNN can predict the word w obtained from an input image x . The structure of this CNN is similar to the above structure (as shown in Figure 2 ), but the last fully connected layer has a neuron to represent the encoding vector. The score from the fully connected layer can be expressed as the possibility of applying a logical function to each neuron, an n-ary grammar. Therefore, CNN learns the appearance of each n-ary grammar in the input image, so it is an n-ary grammar detector.
Figure 2. N-ary grammar coding model. The recognized text is represented as a combination of its N-ary grammars (bag-of-n-grams).
It can be thought of as 10k, an independent trained two-value classifier using a collection of shared union learning features, trained to detect the occurrence of a particular n-ary grammar.
by using logical functions, the training problem is converted to | | A two value classification problem , each reverse propagates the logical regression loss associated with each n-ary grammar class. In order to train an n-ary grammar for the entire range of changes (some of which appear very frequently, some few occurrences), reverse the frequency at which each n-ary grammar class in the word corpus is trained to scale their gradients.
in this model, select in all possible n-ary Grammar modeling spaces | | An n-ary grammar carries out its own internal language statistics. This can be seen as a language model for compressing the coded representation space, but does not limit the predictive power of non-restricted recognition. Most of the time it always corresponds to the only worlds in natural languages, and non-verbal words always contain a very small number of n-ary grammars in the model set, resulting in a large number of non-unique encodings.
(2) Combinatorial models (Joint model)
Maximizing the posterior probability of a sequence of characters (posterior probability) (formula (1)) is equivalent to maximizing Log-score, which represents the logarithm of the posterior probability of the I-character in a sequence. The graph associated with this function is a collection of points, each of which is a unary entry and does not contain any edges. Maximizing this function, therefore, maximizes each entry individually.
Here, the model is extended to combine with the N-ary grammar detector to encode an n-ary grammar that appears in Word image x. The N-ary Grammar scoring function assigns a score,n to each string s of length less than N, which is the maximum value of the N-ary grammar model. Note that, unlike the previous definition, the location is independent. However, it is applied repeatedly to one word in each location I.
formula (2)
Obtained from the CNN character Predictor (CNN character Predictor), as shown in Figure 3 , is obtained by the CNN N-ary Grammar predictor. Note: The N-ary Grammar scoring function only defines a subset of N-ary grammars modeled by CNN, if Socre= 0.
Figure 3 the description of the construction of the path score S (camel,x) for word camel. The unary and edge entries for score are selected by the path through the character position map, as shown in the upper right corner.
The values of these entries, and, among them, are given by the character sequence CNN (CHAR CNN) and N-ary grammar encoded by CNN (NGRAM CNN) output.
The graph associated with the function (2) is related to the sequence n. Therefore, when n moderate size, use directed Search to maximize (2), and then find the prediction World w*.
Structural output loss (structured outputs Loss). The unary and edge scoring functions should be combined with the output of the character sequence model and the N-ary grammar coding model, respectively. A simple way to do this is to weight the output of CNN after removing the Softmax normalization and logic loss:
Formula (3)
Which is the output of the character sequence of the I-character Ci in CNN, is the output of n-ary grammar encoded by the N-ary Grammar s of CNN. If required, you can constrain character weights and edge weights to be shared between different characters, character positions, different grammars in the same order, or all n-ary grammars.
Formula (3) in the weight α, β set, or formula (3) Limit weight variant, can be learned in the structured output learning framework, so that really word (Ground-truth word) score is greater than or equal to the highest error prediction word score plus an edge, namely:, where. To make it a light constraint in a severe convex loss (convex loss):
Formula (4)
Also, in the adjusted empirical risk object, the average of the results is obtained for M samples:
Formula (5)
However, in most cases of formula (3), weights can be combined with the CNN function F and G for score:
Formula (6)
Functions f and G are defined by CNN, so their parameters can be optimized to reduce the loss of formula (5). This can be achieved simply by using standard reverse propagation and SGD. s related differential loss L given:
Formula (7)
which The character sequence model and the N-ary grammar coding model output and the associated differential scoring function formula (6) Give:
Formula (8)
This allows errors to be propagated back to the entire network.
using structured output loss allows the entire model's parameters to be combined optimally in the structure proposed by the formula (6). Figure 4 shows the training structure used. Because of the high-order score in the formula (6), it is costly to go through all possible path spaces to find the w*, even with dynamic planning, so that a directed search is used to find the path with the highest approximate score.
Figure 4. The training structure of the combined model combines the character sequence model (char) and the N-ary Grammar Coding model (NGRAM) with the structured output loss.
The path-selection layer (path Selecet layer) generates score by summing the input Ground-truth Word.
The Directed search layer (Beam search layers) selects the largest score from the input by directional search.
Hinge loss implements a ranking loss, which restricts the maximum scoring path to the Ground-truth path and can be propagated back to the entire network to learn all the parameters together.
3. Experiments and Results
(1) Data set: Icdar 2003, Icdar 2013, Street View Text, iiit 5k-word, synth90k
(2) Implementation details
the character sequence model is represented by Char, the N-ary grammar coding model is represented by Ngram, and the composite model is represented as joint.
Char and Ngram have the same base CNN structure. This base CNN has 5 convolutional layers and 2 fully connected layers. Word image is normalized to a grayscale image of 32*100 (ignoring the aspect ratio) and then as input to the model. In addition to the last layer, the activation function (rectified linear units) is used for each of the other layers. The convolution layer has 64, 128, 256, 512, 512 square filter at a time, the size of the edge of the sliding window is 5, 5, 3, 3, 3, respectively. The step size of the convolution slide is 1, and the spatial dimension is protected by filling in the input feature map. The max-pooling layer of the 2*2 follows the 1th, 2, and 3 layers of the convolution layer. The fully connected layer has 4,096 units. At the top of this base CNN, this char model has 23 separate fully connected layers with 37 units, allowing you to recognize word up to nmax=23 characters long. the Ngram model appears frequently in 10k (at least 10 times in the synth9k word corpus, contains 36 unary grammars, 522 two-house grammars, 3,965 ternary grammars, and 5,477 four-gram grammars), and N-ary grammars that are less than or equal to 4 Performs a selection operation. Requires that the last fully connected layer of base CNN has 10k units. In training, the directional search width is 5, which is 10 when testing. If you use a lexicon to constrain the output, instead of using a directed search, use a lexion word-related path (6) to score, with the highest scoring word as the final result.
These three models are trained using the SDG and dropout regularisation. The joint is initialized by the trained Char and Ngra network weights, and the convolution layer right is not changed during training (the word "Frozen" is used in this article, I do not know if this understanding is correct).
(3) Experiments and results
A.char and joint recognition accuracy rate
B. Comparison of the methods in this article with those in the relevant papers
"Paper notes" deep structured Output Learning for unconstrained Text recognition