Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network
Directory
- Author and RELATED LINKS
- Background introduction
- Method Summary
- Method details
- Experimental results
- Summary and Harvest Point
- Reference documents
Author and RELATED LINKS
- Personal home: Tong He, Search, Choyu, Yiaojian
- Author's simple information:
- Paper Download: Paper Portal
Background introduction
- The general flow of the bottom-up approach (bottom up)
- Step 1: Extracting candidate areas using sliding window or MSER/SWT method
- Step 2: Character-level classifier (SVM,CNN, etc.)
- Step 3: Post-processing, such as Text line formation (Word cluster, character grouping), Word cutting, etc.
- disadvantages of the bottom-up method (bottom up)
- STEP1 generally uses low-level features (pixel level), not robust , For adoptable uneven, large deformation and other targets can not be extracted from the candidate area
- Step1 generated Candidate areas are often many , The pressure on subsequent character-level classifiers is high, and the more candidate regions will result in less overall efficiency
- post-processing is often complex, requiring Many manual rules, parameters Span style= "COLOR: #000080", and does not pass with , Especially when the library changes are large, the parameters are likely to need to modify the
- multi-step pipeline easy to cause the error accumulation , And the overall performance is limited by each step
- Improvement from traditional method to introduction of CNN Method
- Character-level CNN disadvantages:unreliable, inefficient,complicated,not robust
- Improved thinking one: From character-level CNN to string-Level CNN (Text-line-level CNN, text block-level CNN)
- Make use of contextual information of text area, more robust;
- No need for complex post-processing, more reliable universal;
- Improved thinking Two: Modify the CNN structure, change from the classic CONV+POOL+FC to FCN (full convolution)
- Compute sharing, more efficient
- Remove FC, can handle various scales of input
- CNN is no longer just classifying, and doing regression, and also doing regression on location
Method Summary
figure 1. Two-Step coarse-to-fine text localization results by the proposed cascaded convolutional text Network (CCTN). A coarse Text Network detects text regions (which may include multiple or single text lines) from an image, whi Le a fine text network further refines the detected regions by accurately localizing Each individual text line. The ORANGE bounding box indicates a detected region by the coarse text network. We have both options For each text region: (i) Directly output the bounding box as a final detection (solid ORANG E); (ii) Refine the detected region by The fine text Network (dashed ORANGE), and generate a accurate location for Each of the text line (RED solid Central Line). The refined regions may include multiple text lines or an ambiguous text line (e.g., very small-scale text).
-
- The method of this paper is mainly divided into two major steps, first using a coarse-cnn to detect rough text area (text block), Figure1 in the yellow dotted section, and then fine-cnn extract text lines in the text area, Figuire1 the red line. The yellow implementation in the figure shows that some coase CNN-obtained text area can be directly output as text line.
- Key points-modification of VGG16 to Coarse/fine CNN
-
- convolutional nuclei change from 3*3 to 3 kinds: 3*7,3*3,7*3 (multi-shape), and multiple convolution or parallel, not continuous!
- Introduced 2 1*1 fully convolution instead of the original fully connected layer: The input image size can be arbitrary, because it is convolution, there is no full connection
- Multiple layers for fusion (Multiscale): Pool5 is 2*2 pooled, so the last up sample can be fused with POOL4
Method details
The method in this paper is two-step, coarse CNN is used to detect the candidate text area, and then fine CNN is used to find the exact text line position from the candidate text area.
Coarse CNN and fine CNN use the same network structure, the input image size is 500*500, the difference is:
- For coarse CNN, the final loss layer uses only the text region's supervisory information, which is what you say it's groundtruth and finally get the heat map as left. And fine CNN the last loss and output is two, one is the same as coarse text region supervision, the other is the text line supervision. As shown in the image on the right. The GT of text line is 1 at the center of the entire text line, gradually extending up and down, gradually diminishing with the Gaussian distribution, with a radius of half the height of the entire bounding box. As a result, the GT of text line actually contains the position of the text lines and the height information of the text blocks.
Coarse CNN (left) and fine CNN (right) used GT
Coarse CNN output (b) and fine CNN output (E and F)
-
- Coarse CNN input is the entire map directly resize into 500, and fine CNN input is coase CNN obtained candidate area, but the candidate area needs to be in the boundary padding 50, and the whole map resize into 500*500.
Figure 3. (b) an resizedx-input image, and the actual receptive filed of new Pool-5, which is computed a s the response area in the input image by propagating the error of a single
Neuron in the new Pool-5.
- For the text region of coarse CNN, how do you determine whether you want to refine (run fine CNN) or output directly to a single text line?
- Binary Coarse CNN heatmap (Threshold 0.3)
- Area ratio and borderline ratio in the calculation diagram, if the former is greater than 0.7 and the latter is greater than 5, the output is directly to a single text line
- Otherwise you will be refine. The image is crop down by 1.2 times times, and padding 50 (0) by the boundary, the entire patch block resize 500 500, input to fine CNN for refine to get more detailed text line output
- For the two heatmap that fine CNN gets, how do you combine the exact text line (bounding box) output?
- Each heatmap is rectangle with mar (minimum area rectangle) (the height of the text line is multiplied by 2)
- Combine the rectangle of two heatmap (how the authors did not mention) to get an accurate text line output
Experimental results
- Running time: 1.3s
- Coarse CNN vs Fine CNN
- icdar2011,icdar2013 test Results
- Multi-lingual and multi-directional test results
Summary and Harvest Point
- The highlights of this article have two points, the first is to solve the problem of the idea from the bottom up pipeline changed to the now popular top down, first detect the candidate text block area, and then in the rough text area to find a finer text line. The robustness, reliability, efficiency and complexity of the method are all better. The second highlight is that the traditional CNN can be used to detect the text area, the improved point in the revision of the convolution core aspect ratio, the introduction of full convolution instead of full connection, multi-layer fusion three points.
Paper Reading (Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network)