Wei Xiu Ginseng
Links: https://zhuanlan.zhihu.com/p/21824299
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Speaking of Tesla, you might immediately think of a fatal car accident that took place in Tesla Model S automatic driving this May. Preliminary investigations revealed that the driver and the autopilot system failed to notice the white body of the traction trailer in the intense daylight conditions, and therefore failed to start the braking system in time. As the traction trailer is crossing the highway and the body is high, this special case causes the Model S to pass through the bottom of the trailer, the front windshield and the bottom of the trailer collided, causing the driver to die.
Coincidentally, August 8 U.S. Missouri State A man, Tesla Model X owner Joshua Yas Nilly (Joshua neally) on the way to work on the sudden pulmonary embolism. With the help of Model X's autopilot autopilot function, he arrived safely at the hospital. This "one suppressed one yang" really let a person memorable, slightly some "reminder, analyticals" meaning.
Curious readers are bound to wonder: what is the rationale behind this "11% defeat"? Which part of the autopilot system was wrong and caused a car accident? Which part of the technology supports the automatic driving process?
Today, let's talk about an important core technology in autonomous driving systems-Image Semantic Segmentation (Semantic image segmentation). Image semantic segmentation, as an important part of image understanding in computer vision (computer vision), is becoming more and more prominent in the industry, and semantic segmentation is one of the hot topics in academia nowadays.
What is image semantic segmentation?
Image semantic segmentation can be said to be the cornerstone of image understanding technology, in the autonomous driving system (specifically for Street View recognition and understanding), unmanned aerial vehicle applications (landing point judgment) and wearable device applications are pivotal.
As we all know, the image is made up of many pixels (Pixel), and "semantic segmentation", as the name implies, is to group (Grouping)/split (segmentation) the pixels according to the semantic meanings of the images. One of the standard datasets taken from the field of image segmentation Pascal VOC. Among them, the left image is the original image, the right image is the real tag of the split task (Ground Truth): The red area represents the image pixel area with the semantics of "person", the blue-green represents the "motorbike" semantic area, black means "background", White (edge) indicates unmarked areas. Obviously, in the Image Semantic segmentation task, the input is a three-channel color image, and the output is a corresponding matrix, each element of the matrix indicates the semantic category (Semantic label) represented by the corresponding position pixel in the original. Therefore, the image semantic segmentation is also called "Image semantic Annotation" (Images semantic labeling), "Pixel Semantic annotation" (semantic pixel labeling) or "Pixel Semantic grouping" (Semantic pixel grouping).
From the graph, we can clearly see that the difficulty of the semantic segmentation task lies in the "semantic" word. In real images, the same object that expresses a semantic is often composed of different parts (such as Building,motorbike,person, etc.), and these parts often have different colors, textures and even brightness (such as building), which brings difficulties and challenges to the precise segmentation of the semantic of the image.
Semantic segmentation in the pre-DL era from the simplest pixel-level "threshold Method" (thresholding methods), pixel-based clustering (clustering-based segmentation methods) to "graph partitioning" The segmentation method (Graph partitioning segmentation methods), before deep learning (learning, DL) "Unified Lake", the work of image semantic segmentation is "blossoming". Here, we only take "normalized cut" [1] and "Grab cut" [2] two classical segmentation methods based on the graph division as an example, introduced the former DL Era semantic segmentation research.
- The normalized cut (n-cut) method is one of the most well-known methods of semantic segmentation based on graph partitioning (graph partitioning), which was published in the 2000 Jianbo Shi and Jitendra Malik in the top journals of the relevant fields Tpami. In general, the traditional semantic segmentation method based on graph partitioning is to abstract the image into graph (graph node as the edge of graph), and then use the theory and algorithm of graph theory (graph theory) to divide the semantic image. The common method is the classical minimum cut algorithm (Min-cut algorithm). However, the classic min-cut algorithm only takes local information into account when calculating the weights of the edges. As shown in the example of a binary graph (will be divided into disjoint, two parts), if only to consider the local information, then the separation of a point is obviously a min-cut, so the result of the partition is similar or such outliers, and from the overall point of view, actually want to divide the group is left and right two parts.
In view of this situation, N-cut proposes a method that takes global information into account for graph partitioning (graph partitioning), which takes into account the connection weights (and) of the two segments, and the full-graph nodes:
.
Thus, in the outlier division, one of the items will be close to 1, and such a graph can not be divided into a smaller value, it is to consider the global information and abandon the purpose of dividing outliers. Such operations are similar to the normalization of features in machine learning (normalization) operations, so called normalized cut. N-cut can not only deal with two kinds of semantic segmentation, but also expand the binary graph into road (-way) graph to complete the multi-semantic image semantic segmentation, for example.
- Grab Cut is a famous interactive image semantic segmentation method proposed by Shine in 2004. Like N-cut, grab cut is also based on graph partitioning, but grab cut is an improved version that can be seen as an iterative semantic segmentation algorithm. Grab cut takes advantage of texture (color) information and border (contrast) information in the image, so that a small amount of user interaction can result in better background segmentation.
In grab cut, the foreground and background of the RGB image are modeled using a Gaussian mixture model (Gaussian mixture model, GMM). Two GMM is used to characterize the probability that a pixel belongs to the foreground or background, and the number of each GMM Gaussian component (Gaussian component) is generally set to. Next, the whole image is characterized by Gibbs energy equation, and then iterative to obtain the optimal parameters of the energy equation as the optimal parameter of two gmm. When GMM determines, the probability that a pixel belongs to the foreground or background is determined.
In the process of interacting with the user, Grab cut provides two ways to interact: one with bounding boxes (bounding box) and the other with scribbled lines (scribbled line) as auxiliary information. For example, when the user provides a bounding box at the beginning, grab cut defaults to the box that contains the main object/foreground, after which the iterative graph is solved to return the foreground result of the deduction, you can find that even for a slightly more complex background image, grab cut still has a decent performance.
However, when processing, the grab cut segmentation effect is not satisfactory. At this point, additional artificial information is required to provide a stronger secondary message: The background area is marked with a red line/dot, and the foreground area is marked with a white line. On this basis, we can get a satisfying semantic segmentation result by running the grab cut algorithm again to obtain the optimal solution. Grab Cut Although the effect is good, but the shortcomings are very obvious, one is only able to deal with two categories of semantic segmentation, and the second is the need for human intervention and can not be fully automated.
Semantic segmentation in DL era
In fact, it is not difficult to see that the former DL era of semantic segmentation work is based on the image of the pixel itself low-order visual information (low-level visual cues) image segmentation. Because this method does not have the algorithm training stage, therefore often the computational complexity is not high, but in the more difficult partition task (if does not provide the artificial auxiliary information), its segmentation effect is not satisfactory.
After the computer vision into the deep learning era, semantic segmentation also entered a new stage of development, with the full convolutional neural network (Fully convolutional NETWORKS,FCN) as the representative of a series of convolutional neural network "Training" semantic segmentation method has been proposed successively, The accuracy of image semantic segmentation is refreshed frequently. Here are three representative practices in the field of semantic segmentation in the DL era.
- Full convolutional neural network [3]
Full convolutional neural Network (FCN) can be said to be the pioneering work of deep learning in the task of image semantic segmentation, from the Trevor Darrell Group of UC Berkeley, published in the Computer Vision Field Top Conference CVPR 2015, and won the best paper Honorable mention.
FCN's idea is straightforward, that is, the semantic segmentation of pixel-level end-to-end (end-to-end) directly, which can be based on the mainstream deep convolution neural network model (CNN). As the so-called "full convolutional Neural Network", in FCN, the traditional all-connected layer Fc6 and FC7 are implemented by the convolution layer, and the final FC8 layer is replaced by a 21 channel 1x1 convolution layer as the final output of the network. There are 21 channels because Pascal VOC data contains 21 categories (20 object categories and one "background" category). For the FCN network structure, Wakahara graph, after a number of stacked convolution and pooling layer operation can get the original map corresponding response tensor (Activation tensor), where the number of channels for the first layer. It can be found that the length and width of the response tensor is less than the length and width of the original image due to the lower sampling effect of the pool layer, which brings the problem to the pixel-level direct training.
In order to solve the problem caused by the sampling, FCN uses bilinear interpolation to sample the length and width of the response Zhang Liang to the original size, and in order to better predict the details of the image, FCN also takes into account the shallow response in the network. Specifically, the POOL4 and Pool3 responses are also taken, respectively, as the output of the model fcn-16s and fcn-8s, combined with the original fcn-32s output to make the final semantic segmentation predictions (as shown).
is the semantic segmentation result of different layers as output, it can be seen that the different semantic segmentation is fine due to the difference of the lower sampling multiples of the pool layer. such as fcn-32s, because it is the final layer of FCN and pooled output, the model has the highest sampling multiples, the corresponding semantic segmentation results are the most sketchy, and fcn-8s can obtain finer segmentation results due to the lower sampling multiplier.
One disadvantage of FCN is that the size (length and width) of the response tensor is getting smaller due to the presence of the pool layer, but the FCN is designed to be output consistent with the input size, so FCN is sampled. However, it is not possible to get all the lost information back without compromising the sample.
In this case, dilated convolution is a good solution-since the pooling of the next sampling operation will bring information loss, then the pool layer removed. But the pooling layer is removed with the receptive field of the network layers, which reduces the prediction accuracy of the whole model. The main contribution of dilated convolution is how to remove the sampling operation at the same time without reducing the sensation field of the network.
As an example of the convolution core, the traditional convolution core in the convolution operation, is the convolution core and the input tensor "continuous" patch-by-point multiplication and summation (such as a, the red dot is the convolution core corresponding to the input "pixels", green for its original input in the Perceptual field). The convolution kernel in the dilated convolution is the convolution operation of the patches of the input tensor separated by a certain number of pixels. As shown in B, after removing a layer of pooling, the traditional convolutional layer is replaced with a "dilation=2" dilated convolution layer after the removed pool layer, at which point the convolution core calculates the input tensor every "pixel" position as the input patch for convolution. It can be found that the perceptual field corresponding to the original input has been enlarged (dilate), similarly, if you remove a pooled layer, it is necessary to replace its subsequent convolution layer with the "dilation=4" dilated convolution layer, shown in C. In this way, even if the pool layer can be removed to ensure that the network feel wild, so as to ensure the accuracy of image semantic segmentation.
From the following several image semantic segmentation can be seen, after the use of dilated convolution this technology can greatly improve the recognition of semantic categories and the granularity of segmentation details.
- A post-processing operation represented by a conditional random airport
Nowadays, many image semantic segmentation work based on deep learning framework is using conditional random FIELD,CRF as the last post-processing operation to optimize the results of Conditional.
In general, the CRF considers the category of each pixel in the image to be a variable, and then takes into account the relationship between any two variables to create a complete graph (as shown).
In the fully-linked CRF model, the corresponding energy function is:
This is a unary term that represents the semantic category of the pixel, and its categories can be obtained by FCN or other semantic segmentation models, whereas the second item is a two-element item, and a two-tuple item can take into account the semantic relations/relationships between pixels. For example, a pixel such as "sky" and "bird" in the physical space is adjacent probability, should be more than "sky" and "fish" such as the adjacent probability of pixels. Finally, by optimizing the energy function of CRF, the results of FCN image semantic prediction are optimized and the result of semantic segmentation is obtained. It is worth mentioning that there is already work [5] to embed the CRF process that was originally separated from the depth model training into the neural network, that is, to integrate the FCN+CRF process into an end-to-end system, the benefit of which is that the energy function of the final predictive results of the CRF can be directly used to guide the training of FCN model parameters, and achieve better segmentation results of image semantics.
Prospect
As the saying goes, "No free Lunch" ("No Lunch"). The image semantic segmentation technology based on deep learning can achieve the segmentation effect by leaps and bounds compared with the traditional method, but its requirement for data labeling is too high: not only the massive image data, but also the accurate pixel-level marking information (Semantic labels). Therefore, more and more researchers have begun to shift their attention to the problem of image semantic segmentation under the condition of weak supervision (weakly-supervised). In this kind of problem, the image only needs to provide the image level annotation (for example, has "the person", has "the car", does not have "the television") but does not need the expensive pixel level information to obtain with the existing method comparable semantic segmentation precision.
In addition, the image semantic segmentation problem of the sample level (Instance levels) is also popular. This kind of problem not only requires the segmentation of different semantic objects, but also requires that different individuals of the same semantics be segmented (for example, the pixels of the nine chairs appearing in the graph should be marked by different colors respectively).
Finally, video-based foreground/object segmentation (segmentation) is one of the new hotspots in the field of computer Vision Semantic segmentation, which is more fit for the real application environment of autonomous driving system.
(starting at the heart of the machine public number, link please poke: column | "Image semantic segmentation" from Tesla to computer vision
Wei Xiu Ginseng, Shei Wei
References:
[1] Jianbo Shi and Jitendra Malik. normalized Cuts and Image segmentation, IEEE transactionson-Pattern analysis and Machine Intelligence, Vol. 8, 2000.
[2] Carsten Rother, Vladimir Kolmogorov and Andrew Blake. "Grabcut"--interactive Foreground Extraction using iterated Graph Cuts, ACM transactions on Graphics, 2 004.
[3] Jonathan Long, Evan Shelhamer and Trevor Darrell. Fully convolutional Networks for Semantic segmentation. IEEE Conference on computer Vision and Pattern recognition, 2015.
[4] Fisher Yu and Vladlen Koltun. multi-scale Context Aggregation by dilated convolutions. International Conference on representation learning, 2016.
[5] Shuai Zheng, Sadeep Jayasumana, Bernardino romera-paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang and Phil IP H. S. Torr. Conditional Random fields as recurrent neural Networks. International Conference on computer Vision, 2015.
"Image semantic segmentation" from Tesla to computer vision