Paper Reading (Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network)

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network

Directory

Author and RELATED LINKS
Background introduction
Method Summary
Method details
Experimental results
Summary and Harvest Point
Reference documents

Author and RELATED LINKS

Personal home: Tong He, Search, Choyu, Yiaojian
Author's simple information:

Paper Download: Paper Portal

Background introduction

The general flow of the bottom-up approach (bottom up)
- Step 1: Extracting candidate areas using sliding window or MSER/SWT method
- Step 2: Character-level classifier (SVM,CNN, etc.)
- Step 3: Post-processing, such as Text line formation (Word cluster, character grouping), Word cutting, etc.
disadvantages of the bottom-up method (bottom up)
- STEP1 generally uses low-level features (pixel level), not robust , For adoptable uneven, large deformation and other targets can not be extracted from the candidate area
- Step1 generated Candidate areas are often many , The pressure on subsequent character-level classifiers is high, and the more candidate regions will result in less overall efficiency
- post-processing is often complex, requiring Many manual rules, parameters Span style= "COLOR: #000080", and does not pass with , Especially when the library changes are large, the parameters are likely to need to modify the
- multi-step pipeline easy to cause the error accumulation , And the overall performance is limited by each step
Improvement from traditional method to introduction of CNN Method
- Character-level CNN disadvantages:unreliable, inefficient,complicated,not robust
- Improved thinking one: From character-level CNN to string-Level CNN (Text-line-level CNN, text block-level CNN)
  - Make use of contextual information of text area, more robust;
  - No need for complex post-processing, more reliable universal;
- Improved thinking Two: Modify the CNN structure, change from the classic CONV+POOL+FC to FCN (full convolution)
  - Compute sharing, more efficient
  - Remove FC, can handle various scales of input
  - CNN is no longer just classifying, and doing regression, and also doing regression on location

Method Summary

Basic process

figure 1. Two-Step coarse-to-fine text localization results by the proposed cascaded convolutional text Network (CCTN). A coarse Text Network detects text regions (which may include multiple or single text lines) from an image, whi Le a fine text network further refines the detected regions by accurately localizing Each individual text line. The ORANGE bounding box indicates a detected region by the coarse text network. We have both options For each text region: (i) Directly output the bounding box as a final detection (solid ORANG E); (ii) Refine the detected region by The fine text Network (dashed ORANGE), and generate a accurate location for Each of the text line (RED solid Central Line). The refined regions may include multiple text lines or an ambiguous text line (e.g., very small-scale text).

- The method of this paper is mainly divided into two major steps, first using a coarse-cnn to detect rough text area (text block), Figure1 in the yellow dotted section, and then fine-cnn extract text lines in the text area, Figuire1 the red line. The yellow implementation in the figure shows that some coase CNN-obtained text area can be directly output as text line.

Key points-modification of VGG16 to Coarse/fine CNN

- convolutional nuclei change from 3*3 to 3 kinds: 3*7,3*3,7*3 (multi-shape), and multiple convolution or parallel, not continuous!
- Introduced 2 1*1 fully convolution instead of the original fully connected layer: The input image size can be arbitrary, because it is convolution, there is no full connection
- Multiple layers for fusion (Multiscale): Pool5 is 2*2 pooled, so the last up sample can be fused with POOL4

Method details

The method in this paper is two-step, coarse CNN is used to detect the candidate text area, and then fine CNN is used to find the exact text line position from the candidate text area.
Coarse CNN and fine CNN use the same network structure, the input image size is 500*500, the difference is:
- For coarse CNN, the final loss layer uses only the text region's supervisory information, which is what you say it's groundtruth and finally get the heat map as left. And fine CNN the last loss and output is two, one is the same as coarse text region supervision, the other is the text line supervision. As shown in the image on the right. The GT of text line is 1 at the center of the entire text line, gradually extending up and down, gradually diminishing with the Gaussian distribution, with a radius of half the height of the entire bounding box. As a result, the GT of text line actually contains the position of the text lines and the height information of the text blocks.

Coarse CNN (left) and fine CNN (right) used GT

Coarse CNN output (b) and fine CNN output (E and F)

- Coarse CNN input is the entire map directly resize into 500, and fine CNN input is coase CNN obtained candidate area, but the candidate area needs to be in the boundary padding 50, and the whole map resize into 500*500.

Figure 3. (b) an resizedx-input image, and the actual receptive filed of new Pool-5, which is computed a s the response area in the input image by propagating the error of a single
Neuron in the new Pool-5.

For the text region of coarse CNN, how do you determine whether you want to refine (run fine CNN) or output directly to a single text line?
- Binary Coarse CNN heatmap (Threshold 0.3)
- Area ratio and borderline ratio in the calculation diagram, if the former is greater than 0.7 and the latter is greater than 5, the output is directly to a single text line
- Otherwise you will be refine. The image is crop down by 1.2 times times, and padding 50 (0) by the boundary, the entire patch block resize 500 500, input to fine CNN for refine to get more detailed text line output
For the two heatmap that fine CNN gets, how do you combine the exact text line (bounding box) output?
- Each heatmap is rectangle with mar (minimum area rectangle) (the height of the text line is multiplied by 2)
- Combine the rectangle of two heatmap (how the authors did not mention) to get an accurate text line output

Experimental results

Running time: 1.3s
Coarse CNN vs Fine CNN

icdar2011,icdar2013 test Results

Multi-lingual and multi-directional test results

Results example

Summary and Harvest Point

The highlights of this article have two points, the first is to solve the problem of the idea from the bottom up pipeline changed to the now popular top down, first detect the candidate text block area, and then in the rough text area to find a finer text line. The robustness, reliability, efficiency and complexity of the method are all better. The second highlight is that the traditional CNN can be used to detect the text area, the improved point in the revision of the convolution core aspect ratio, the introduction of full convolution instead of full connection, multi-layer fusion three points.

Paper Reading (Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Paper Reading (Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Paper Reading (Weilin huang--"arXiv2016" accurate text Localization in Natural Image with cascaded convolutional Text Network)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support