Paper Reading (Weilin huang--"TIP2016" text-attentional convolutional neural Network for Scene Text Detection)

Source: Internet
Author: User

Weilin huang--"TIP2015" text-attentional convolutional neural Network for Scene Text Detection)

Directory
    • Author and RELATED LINKS
    • Method Summary
    • Innovation points and contributions
    • Method details
    • Experimental results
    • Question Discussion
    • Author and RELATED LINKS
    • Summary and Harvest Point
    • Author Supplemental Information
    • Reference documents

    • Author and RELATED LINKS
      • Paper download
      • Tong He, search, Choyu, Yiaojian

    • Method Summary
      1. Extracting candidate character areas using the improved version of Mser (ce-msers,contrast-enhancement);
      2. Use the new CNN (text-cnn, a combination of pixel-level information, character multi-class labels, character class two tags for monitoring information to train text-attentional CNN) to filter non-text areas;
      3. String into a string and then cut into words (ref. 1, method of document 2 , not the focus of the article)
    • Innovation points and contributions
      • Idea's starting point:
        • Human judgment in a patch block is generally divided into three steps: First, the text area and the background of the division (pixel-level segmentation, if the text and background almost all stick together, can not judge whether it is a word), the second, to determine what the region is the word (character recognition, if a word we recognize is ' a ', Then we are very confident that it is a word, not a random symbol. Imagine if a person who does not know the word, he judged whether the accuracy of the word is not higher than the person who read, precisely because he lacks the character category information), third, is to make a decision, judge whether the word or noise.
        • The two disadvantages of Mser are: first, vulnerable to background interference, resulting in character breakage, and background adhesion problems; second, the character and background of the contrast is very low, not because "stable" and was mser detected, resulting in missing. Therefore, in order to solve these two problems, can enhance the local contrast of the text area, the contrast-enhanced map to extract Mser can improve the recall rate.

If you do not know these words, it is difficult to determine whether this is really a ' word ' or a blind writing stroke

      • Innovative points:
        • The contrast enhanced version of Mser was proposed to improve the recall rate.
        • This paper presents a TEXT-CNN model based on multi-task learning, and introduces a new training mechanism, which combines low-level pixel information (segmentation problem) to advanced character multi-class information (62 character recognition problem), character and non-character information (2 class character classification problem) into a text-cnn model. The text detector with stronger resolution and robustness is realized.
    • Method details
      • Text-cnn
        • Network structure diagram

        • 3 Quests

        • Loss function for 3 tasks (top to bottom, respectively binary,label,region)

          • ,,,

        • Total Loss function:

      • 3 Task Network structure:
        • Pixel-level segementation Task:conv1→conv2→deconv1→deconv2→loss (5) "Two convolution, two deconvolution"
        • Character label Task:conv1→conv2→pool2→conv3→fc1→fc2→loss (4) "Three convolution, one pooled, two fully connected"
        • Text/non-text Task:conv1→conv2→pool2→conv3→fc1→fc2→loss (3) "Three convolution, one pooled, two fully connected"
      • The reason of pool layer design
        • The pooling layer itself is irreversible, that is, the deconvolution is unable to retrieve the original information, so you can not use the pool layer before going to convolution, so only after the second layer of the pool layer
        • The third layer of the volume after the image is very small, so there is no need to re-use the pool layer
        • Experiments have shown that the use of a pooled layer: performance is not reduced, the speed is improved
      • Training process
        • Pre-train:label task and Region task are trained by 10:3 (loss function ratio, λ1 =1,λ2 =0.3), using a library of synthetic database charsynthetic with 30k iterations
        • Train:label task and main task are trained in 3:10 (λ1 =0.3) respectively, using a library that is real chartrain, with an iteration count of 70k
        • The reason for this training: three kinds of tasks use different characteristics (region task uses the characteristics of Pixel-level, belongs to the low-level features), convergence speed is also different. If the region task has as many training times as the main task, it causes overfitting. After the first phase of training two tasks, the model parameters have been recorded in the pixel-level information. For the training phase, the loss function of three kinds of tasks varies with the number of iterations.

      • Ce-msers
        • Algorithm Steps (main:
          • STEP1: Using contrast cues and spatial clue Clustering ( Ref. 3) to generate a contrast area map Map1
          • STEP2: Use color space Smoothing ( Ref. 4) to generate a contrast area map MAP2
          • In the original image, the MAP1,MAP2 are used Mser respectively.

    • Experimental results
      • Experimental results show that multitasking (c) is better than traditional CNN (a), using only one additional task, character recognition task (b)

      • The experiment proves that the key features of distinguishing characters and non-characters can be learned by using the text-cnn of this paper.

      • ICDAR2015

      • ICDAR2011 (Ce-msers better than msers, with three task-trained text-cnn than single-tasking, dual-task hungry)

      • ICDAR2013

      • msra-td500

    • Question Discussion
      • What are the pros and cons of using a pooled layer?
        • Pros: Reduce the complexity of parameters and models
        • Cons: Loss of spatial information, and pooling layer is irreversible
      • Why is the region task a regression problem?
      • Why is the region task and the label task used during training, not when testing?
      • The implementation of Ce-mser?
      • Why is a label task a class 62, not a 63 class (containing a noise class)?
      • For a negative sample, what does the mask in the Groundtruth of the region task do? What is the negative sample category in the label task?
    • Author and RELATED LINKS
      • Author information
        • Tong He, search, Choyu, Yiaojian

    • Summary and Harvest Point
      • Ce-mser provides a way to increase the contrast to improve the recall rate, but the implementation method is not very good. The mser itself is relatively time-consuming, and it is necessary to do another two mser on the contrast-enhanced map, which obviously costs too much. A better approach should be to change the Mser's internal algorithm, modify the meaning of "stable" or do a certain contrast enhancement for each component to extract and so on.
      • The training method of multi-task learning can refer to the idea of this article: different tasks share some layers
      • The idea of merging pixel-level information, character class-level information into detection is desirable

    • Reference documents
    1. W. Huang, Y. Qiao, and X. Tang, "Robust scene text detection with convolution neural network induced mser trees," in P Roc. 13th Eur. Conf. Comput. Vis. (ECCV), pp. 497–511.
    2. C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, "Detecting texts of arbitrary orientations in natural images," Proc. IEEE Comput. Vis. Pattern recognit. (CVPR), June, pp. 1083–1090.
    3. H. Fu, X. Cao, and Z. Tu, "cluster-based co-saliency detection," IEEE Trans. Image Process., vol. . pp. 3766–3778, Oct. 2013.
    4. M. M. Cheng, G. X. Zhang, N. J. Mitra, X. Huang, and S. M. Hu, "Global contrast based salient region detection," in proc. IEEE Comput. Vis. Pattern Recognit. (CVPR), June, pp. 409–416.

Paper Reading (Weilin huang--"TIP2016" text-attentional convolutional neural Network for Scene Text Detection)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.