ctpn:detecting text in Natural Image with connectionist text proposal Network

Source: Internet
Author: User

Previously mentioned CTPN, here on the study, first or the old routine, from the paper learn it. Here is the English original paper website for everyone to read: https://arxiv.org/abs/1609.03605.

CTPN, used to think that abbreviations are usually sorted from the beginning of the topic to choose the first letter, afraid of ignorant, the full name is "detecting text in Natural Image with connectionist text proposal Network", Translation is based on connection proposal (literal translation is too uncomfortable!!) ) text detection of the network.

The author describes in the paper, according to their proposed method can be in the image of the text line to accurately locate. His basic approach is to detect text lines directly from a series of appropriately sized text proposals generated on the feature map obtained by convolution.

One of the highlights of this paper is that the author proposes a vertical anchor mechanism, which can predict the position of fixed width proposal and the score of text/non-text at the same time, which can greatly improve the accuracy. There's a little problem here, and since we're using proposals, where did they come from?

The authors tell us that these ordered proposals are obtained by RNN (recurrent neural networks), and that RNN can be well combined with CNN to form a end-to-end model that can be trained. The author also very consciously mentioned, with RNN so the advantage of this operation can make CTPN explore rich image context information, can detect blurred text, looks good ah ... Highlight the pros of CTPN, according to the authors, CTPN can be detected in multi-scale and multilingual text. And, more conveniently, he doesn't need to be dealt with later.

The authors suggest that the vertical anchor can accurately predict the position of the text at the appropriate scale. Then, the proposed in-network loop structure, the larger size of the text can be proposal an orderly connection, the advantage is that the context-rich text information can be encoded.

For object detection, it can be generally considered. If the bounding box is detected to overlap with groundtruth greater than 0.5, you can easily recognize the object from the main part of the picture. Text detection requires overwriting the entire area of a text line or word. The author mentions that a more evaluated standard for text detection is the Wolf standard. By extending the structure of the RPN, the author accurately locates the lines of text.

The author stated that their work was mainly composed of four parts: the first part, the author transforms the problem of text detection into the problem of locating a series of suitable size text proposals. Personal feelings distract the problem (the idea of turning complex issues into simple questions is worth learning from). Therefore, in order to achieve the above-mentioned process, the author proposes a anchor regression mechanism, which can predict the vertical position of each text proposal and the fractional value of text/non-text in order to obtain better positioning. By contrast, RPN seems to provide a poor positioning.

Next, the big guys come up with a in-network loop mechanism that can create a continuous text proposals directly on CNN's feature map. This wave operation can make a meaningful exploration of the information of the text line context. Third, the big Guy also stressed that the above two methods can be well in line with the nature of the text sequence, the formation of end-to-end training model. The four is the achievements of some big boys, here interested students can be viewed from the paper, here will not repeat.

The traditional method of text detection can be divided into two categories, one is connection component (CC) and the other is sliding window. CC is used to differentiate text/non-text pixels by using fast filter, and then divide the text pixels into strokes or candidate characters by using low-level properties (intensity, color, gradient). A sliding window is a multi-size window that moves through dense images on the image. The character/non-character window is distinguished by a pre-trained classifier using a manually designed feature or a CNN feature on the back layer. Of course, the author points out the drawbacks of the two methods, that is, in the next component filter and text lines in the construction process, errors will accumulate, resulting in greater error. In addition, it is very difficult to exclude non-character components and detect text lines more accurately at the same time. A big problem with sliding windows is that it is computationally expensive because you need to run the classifier on a large number of Windows.

The author tells us that in object detection, a more common way is to generate some proposals through some low-level features, and then send to the convolutional network for further classification and modification. Among them, the selective search (SS) can produce proposals, which is widely used at present. There are RPN in faster r-cnn, and feature is obtained directly from CNNN's proposals map. Because of the shared convolution, the RPN operation speed is relatively fast. However, RPN is not differentiated and needs further classification and repair.

The focus is here, the structure of CTPN is introduced here, the three key of this network is: one is to detect the text in the Fine-scale proposals, the second is the text proposals of the loop connection, and finally the auxiliary refinement operation. Let's take a look at how the authors have achieved this in order.

Detecting Text in Fine-scale proposals

The author tells us that CTPN and RPN are similar in that they are also a complete convolutional network, and can allow the input of images of any size. As mentioned earlier, CTPN detects a line of text through a dense mobile window on CNN's feature map, outputting a series of appropriate dimensions (fixed width of 16 pixels, which can be seen from the right, and the length can be adjusted) of the text proposal. The author explains in VGG16, (Why Choose VGG16 Reason is VGG16 is a large-scale data training to get the model, our daily life of data compared to training VGG16 data set to several levels, so, generally VGG16 for migration learning, fine-tuning). The author uses a spatial window of size 3*3 to slide the window on the feature map of the last layer of convolution (VGG16 conv5). The size of the CONV5 feature map is determined by the size of the input image. The total step size and the sensation field are 16 and 228 pixels respectively. They all have a fixed range of network structure decisions. Here's an implicit question: the previous mention of sliding windows requires a lot of computation, but here it is, why? The reason is that sliding windows in the convolution layer can share the convolution, which can reduce the computational cost. Various sizes of objects can be detected using sliding windows of different size. Well, Ren proposes that the anchor regression mechanism allows RPN to detect Multiscale objects using a single-scale window. At the heart of the idea is the use of flexible anchors to predict objects in the range of large scales and aspect ratios.

Here the author suggests that text detection is different from object detection. Text detection does not have a distinctly closed boundary and is a sequence that may not have a clear distinction between strokes, characters, words, text lines, and multi-level components such as text. Therefore, text detection is defined on a line of text or text.

It is difficult to accurately predict the level of words that can be seen by RPN, because each character in the word is separated and the head and tail of the text are not well differentiated. Therefore, the author thinks that a line of text is treated as a series of scaled text proposals, and each proposal typically represents a small portion of a line of text. The authors also feel that it is more accurate to predict the vertical position of each proposal at each time, because the level of the bad prediction ... Because RPN predicts four coordinates of the object, the search space is reduced.

The authors propose a mechanism for vertical anchor that can predict the position of text/non-text fractions and y-axes of each proposal at the same time. Detecting fixed-width text proposal is easier than detecting individual or detached characters. To allow text lines to be detected on a series of fixed-width text proposal, you can also handle multiple proportions and aspect ratios. The authors personally designed the text proposals. First, detector dense search for each space location in the CONV5. The text proposal has a fixed width of 16 pixels that is meaningful (feature map in dense through conv5) with a total step size of exactly 16 pixels. Next, we have designed K vertical anchor for each proposal to predict the y-coordinate of each point. This K-anchor has a fixed 16-pixel horizontal position, but the vertical position varies at K-different heights, where the author uses 10 anchors. The height changes in 11-273 pixels, how the vertical coordinates are calculated, and the author tells us that it is calculated from the height of a proposal bounding box and the center of the y-axis. The calculation of the relative vertical coordinates of the predicted anchor bounding box is obtained by the following formula,

Parameter description: v={vc,vh},v*={vc*,vh*} is the predicted coordinates and Groundtruth, respectively. Cy and Ha are the center and height of the y axis of the anchor box, which can be calculated in advance based on the input image. Therefore, each predicted text proposal has a bounding box of size h*16 (as pictured above), in general, the text proposal much smaller than the 228*228 of the field of perception.

Here is a summary of the processing of the detection, given a picture, here by a w*h*c conv5 of the feature map,detector through a size 3*3 window dense sliding conv5, each sliding window with a 3*3*c feature Map is used to make predictions. For each prediction, the position of the horizontal position and K-anchors is fixed, which is calculated by the input image in the position of the window on the CONV5 feature map. detector outputs the coordinates of the text/non-text fraction and the predicted y of the K anchor at each window position. The resulting text proposals is generated by a text/non-text fractional value greater than 0.7 (by using the NMS) of anchor. By using the vertical anchor and Fine-scale policies, the detector can handle various scales and aspect ratios of the lines of text, further saving the computational amount and time.

The first part of the introduction is finished, Next is reccurrent connectionist Text proposals

To improve accuracy, the author divides the lines of text into a fine-scale sequence text proposal, and predicts each of them separately. Why the text proposal to be a sequence, because the independent proposal is not thoughtful, it is easy to make the picture and text similar structure (the author raised a small chestnut: bricks) predicted as text, which caused the error. Also in order to prevent ignoring the case with less text information.

The comparison between order and disorder is as follows:

The author proposes that the goal of this structure is to encode the contextual information directly on the convolution layer and form a tight network internal connection. RNN use a hidden layer to loop through the information. As a matter of course, the author designs a RNN layer on the CONV5, which takes the convolution property of each window as a continuous input, and updates its internal state in the hidden layer h of RNN, the formula is as follows:

Parameter description: Where the XT represents the input conv5 from the first T sliding window, the sliding window is sliding from left to right, resulting in a row of 1 .... The order of the W is features. W is the width of the conv5. HT is the current internal state computed with the current input XT in conjunction with the previous state Ht-1. The author uses lstm as the RNN layer. In CTPN, the author uses bidirectional lstm to further extend the RNN, allowing it to encode the context in two directions, allowing the connection sensation to overwrite the entire width of the image. For each lstm with 128D hidden layer, two-way RNN asked 256D, the status of the HT hidden layer is mapped to the next fully connected layer, the output layer, used to calculate the prediction of T-proposal. As can be seen, the author's pay is still rewarded.

Most of them have already been introduced, and then basically say side-refinement.

According to the author, text/non-text fractions with continuous text proposal are more than 0.7, and the lines are very easy to build.

Text line creation process: First, introduce a definition, the first condition is that when BJ is the closest to the bi horizontal distance, the second condition is that the distance is less than 50 pixels, the third condition is the vertical overlap to be greater than 0.7, so that three conditions can be defined as a bi of BJ as an adjacent field of bj-> Bi. After the conceptual problem is resolved, the author emphasizes that if Bi->b j,bj->bi, you can row two proposal as a pair so that the lines of text can be built by connecting the same propsoals sequentially. Accurate detection and RNN connections accurately predict the position of the vertical direction. Horizontally, the image is divided into a series of equal-width proposal of 16 pixels. When two horizontal proposal are not overwritten by ground truth lines of text, the predicted position is inaccurate,

The above problem has little effect on object detection, but it can't be neglected in the detection of text detection especially small text. Therefore, Side-refinement is proposed to solve this problem, this method can accurately estimate the left and right sides of the horizontal direction of each anchor/proposal offset. The offset is calculated as follows:

Xside is the coordinate of the nearest horizontal edge (left or right) from the current anchor distance. X*side is the horizontal GT coordinate, calculated from the position of the GT bounding box and the anchor. The cx_a is the center of the horizontal direction anchor. WA is the width of the anchor (fixed at 16). When a series of detected text proposals is concatenated into a single text row. Side-proposal is defined as the beginning and end of the proposals (the origin of side is finally understood). The author uses only the offset of the side-proposal to redefine the bounding box of the text line in the left.

The main part of the paper is almost there, so let's take a look at the output of this CTPN and his loss function.

The three outputs of the CTPN are all connected together to the fully connected layer. These three outputs also predict the offset of text/non-text fractions, vertical coordinates, and side-refinement. The K-anchor were used to predict their three respectively, and then the output layer produced 2k,2k, and K parameters. The author uses multi-task learning to jointly optimize the model parameters, the objective function is as follows:

Each of the anchor is a training sample. I is a sequence of anchor in a minibatch. Si is the probability that the anchor I prediction is a true text, s* is Gt{0,1},j is the index of anchor in the effective anchor set used for y-coordinate regression, and his definition is

A valid anchor is defined as Positivate anchor s*j=1, or is overlapping with the GT text proposal greater than 0.5. VJ and vl* are the y-axis predictions for the J-anchor and GT.K for Side-anchor, Side-anchor for a series of grount from left to right in the horizontal distance range (32 pixels) from the Truth Anchors Text line box. OK and ok* are the predicted and GT offsets of the K-anchor on the x-axis. LS is a softmax loss for distinguishing between text/non-text, both LV and lo are regression losses, where lamda1,lamda2 is a loss weight, based on experience set to 1.0 and 2.0.

CTPN can be trained with standard reverse propagation and random gradient descent for End-to-end. As with RPN, the training sample is anchors, and its position can be calculated in advance by the input images, so each anchor training tag can be calculated according to Gtbox.

For text/non-text classification, binary tags are assigned to each positive anchor (text) and negative anchor (non-text), and the positive and negative anchor are computed by overlapping IOU with the GT boundary. Positive anchor are defined as: the overlap of IOU and gtbox is greater than 0.7 or the highest (set is a small text pattern will also be divided into a positive anchor) anchor. The negative anchor is produced by IOU less than 0.5. CTPN's theory is generally discussed here, there are also the author's experimental results and discussed, you can refer to the previous paper address for further study.

ctpn:detecting text in Natural Image with connectionist text proposal Network

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.