Segnet: A deep convolutional encoding for image segmentation-Decoding Architecture Summary
We present a novel and practical deep full convolution neural network structure, which is used for pixel-wise semantic segmentation, and named Segnet. The core of the trained segmentation engine consists of a coded network, and a corresponding decoding network, followed by a classification layer at a pixel level. The architecture of the Encoder network is the same as the 13 convolution layers in the VGG16 network. The role of the decoding network is to map the low-resolution encoded feature graph to the feature map of the input resolution . In particular, The decoder uses a pooled index that is computed in the maximum consolidation step of the corresponding encoder to perform non-linear oversampling. This eliminates the need for on-demand learning. The graph after the sample is sparse, and then convolution with the trained filter to produce a dense feature map. We put our proposed architecture and widely adopted FCN architecture and well-known Deeplab-larg Comparing the Efov and deconvnet architectures, this comparison reveals the tradeoff between memory and accuracy for good segmentation performance.
The main motive of Segnet is the application of scene understanding. Therefore, it is designed to ensure efficiency during the prediction period, memory and computational time. The number of training parameters is smaller compared to other computational architectures and can be trained end-to-end using random gradient descent. We're still on the road scene and Sun The RGB-D Indoor Scene segmentation task performs controlled benchmarks for segnet and other architectures. These quantitative evaluations show that the segnet provides competitive inference time and the most efficient inference memory compared to other architectures. We also provide a Caffe implementation and a Web sample HTTP ://mi.eng.cam.ac.uk/projects/segnet/.
1 Introduction
Semantic segmentation has a wide range of applications, from the scene understanding, inferred that the support between the object is related to autonomous driving. Early methods that rely on low-level visual cues have been replaced by popular machine learning algorithms. In particular, deep learning has been successful in the detection of handwritten numerals, speech, integer images, and images [ Vgg][googlenet]. Now the image segmentation field is also very interested in this method [Crfasrnn][parsent] and so on. However, many recent methods have tried to make semantic segmentation directly using the method of image classification. The result is encouraging, but rather rough [ Deeplab]. This is mainly because max-pooling and sub-sampling reduce the resolution of the feature map. Our motivation for designing segnet comes from the need for semantic segmentation from low-resolution feature maps to input resolution mappings. Such mappings must also produce features for precise boundary positioning .
Our architecture, segnet, The purpose of this design is to be an efficient semantic segmentation architecture. It is primarily driven by road-on-site understanding of the application's motivations, the need to model the appearance (road, building), shape (car, pedestrian) capabilities, and to understand the spatial relationship (context) between different categories (such as road and side walks). In typical road scenes, most Buildings, Therefore, the network must produce smooth segmentation. The engine must also have the ability to depict objects according to their shape, even though they are small in size. Therefore, it is important to retain the boundary information in the extracted image representation. From a computational standpoint, the network needs to be efficient in both memory and computation time in the inference process. End-to-end training for use An effective weighting update technique such as random gradient descent (SGD) to combine the ability to optimize ownership in a network is an additional benefit because it is easier to duplicate. Segnet's design comes from the need to meet these standards.
The segnet in the encoding network and the VGG16 are topologically identical. We removed the full join layer, which makes the segnet more recent than many other structures [fcn][deconvnet][parsenet][decoupled] Significantly smaller and easier to train. The key component of the segnet is the decoder network, which consists of a decoder hierarchy that corresponds to each encoder. Where the decoder uses the max-pooling that is accepted from the corresponding encoder Indices the nonlinear upsampling of the input feature graph. This idea comes from the architecture designed for unsupervised functional learning. Reusing max-pooling in a decoding network Indics has several practical benefits: (1) It improves the boundary Division (2) reduces the number of parameters for end-to-end training (3) This form of upsampling can be combined into any encoding-decoding form of the schema [FCN][CRFASRNN] with only a small amount of modification.
One of the main contributions of this paper is our analysis of segnet decoding techniques and the widespread use of FCN. This is to convey the actual trade-offs in the design segmentation architecture. Recently many segmented depth architectures use the same coded networks, such as VGG16, but in the form of decoding networks, Training and reasoning are different. Another common feature is that these networks often have an million-level training parameter, which leads to difficult end-to-end training [deconvnet]. Training difficulties lead to multi-stage training [FCN], or add a network structure with training such as FCN[CRFASRNN], or with auxiliary support, such as using area proposals[deconvnet in the inference phase], or using disjoint training [decoupled] for classifying and segmenting networks, or with additional data for training [parsenet] or full training [CRFASRNN]. In addition , performance-enhancing post-processing techniques are also welcome. Although these factors are good for improving VOC performance, their quantitative results are difficult to address the key design factors necessary to achieve good performance. So we analyzed the decoding process used in these methods [Fcn][deconvnet], and reveal their strengths and weaknesses.
We evaluated the performance of segnet in two scene partitioning tasks, namely Camvid Road scene segmentation and Sun rgb-d indoor scene segmentation. VOC12 has been divided over the years in the benchmark challenge. However, most of this task has one or two foreground classes surrounded by highly diverse backgrounds. This implicitly facilitates the techniques used for detection, as shown in the recent work on decoupling classified segmentation networks [decoupled ], where the classification network can be trained with a large number of weak tag data, and the performance of independent segmentation network is improved. The [Deeplab] method also uses the feature map of the classification network and the independent CRF post-processing technology to perform the segmentation. Performance can also be enhanced by additional reasoning aids, such as Area Proposals[deconvnet][edge Boxes]. So, therefore, It differs from the scene understanding in that it is designed to perform reliable segmentation using the common presence of objects and other spatial contexts. To demonstrate the efficiency of the segnet, we present a real-time, on-line demo of road scene segmentation to segment the 11 class of autonomous driving interest classes (1). Figure 1 shows some of the random road images from Google and the segmentation results of some random indoor test scene images produced in sunrgb-d.
The remainder of the paper is organized as follows. In section 2 we reviewed recent literature. In section 3 we describe the segnet architecture and analysis of it. In section 4 we evaluated the performance of segnet on both outdoor and indoor datasets. Next up is section 5 General discussion of our approach, pointing out future work. Section 6 is the conclusion.
2 Literature Review
Semantic segmentation is a very active research topic, and a large part of it is due to the challenge of the data set [Pascalvoc][sun Rgb-d][kitti]. Before deep learning arrives, Most of the best-performing methods rely on manually designed features to classify pixels independently. Typically, a region is fed into a classifier such as random Forest or boosting to predict the class probability of the center pixel. The appearance-based features or SFM (not known) have been invented to test the understanding of the Camvid road scene. After using a paired or higher-order CRF to smooth the per-pixel noise prediction from the classifier (often referred to as a unary), the accuracy is improved. Most The near method is designed to try to predict the label of all the pixels in the block, Rather than just a central pixel to produce a high-quality unary item. This improves the results of a random forest unary, but the thin structured classes are poorly categorized. The best performing technique in Camvid testing solves the imbalance between the label frequencies by combining the object detection output with the classifier predictions in the CRF framework. The results of all these technologies indicate the need for improved classification The characteristics.
Since the release of the NYU data set, Indoor RGBD pixel-level semantic segmentation is also welcomed. The dataset shows the usefulness of the depth channel to improve segmentation. Their approach uses features such as rgb-sift,depth-sift and pixel locations as inputs to the neural network classifier to predict pixel unary items. Then use CRF to smooth this noisy unary item. Use richer Feature set improvements, including LBP and region segmentation, for higher accuracy, and then for CRF. There are other ways, and the common attribute of all these methods is to classify RGB or RGBD images using hand-designed features.
The success of deep convolutional neural networks in object classification recently led researchers to use their feature learning ability for structured prediction problems, such as segmentation. It also attempts to apply networks designed for object classification to segmentation, especially by copying the deepest features in the block to match the image dimensions. However, the resulting classification is blocky. Another approach is to use a recurrent neural network [RNN] Several low-resolution predictions have been combined to create an input image resolution prediction. These techniques are already an improvement in manual design features, but their ability to delimit boundaries is poor.
The updated depth structure [fcn][deconvnet][crfasrnn][decoupled] is specifically designed for segmentation, by learning to decode or mapping low-resolution image representations to pixel-point predictions, Improve the state-of-the-art technology. The above network encoding network is used to produce low resolution is, are used VGG16 classification network structure (13 volume base and 3 full connection layer). The weights of these coded networks are specially pre-trained on imagenet. The decoder network differs between these schemas and is the part responsible for generating multidimensional features for each pixel for categorization.
Each decoder in the full-convolutional network (FCN) architecture learns to sample its input feature map and combine it with the corresponding encoder feature graph to produce input to the next decoder. It is a structure with a large number of training parameters in the Encoder network (number of parameters 134M), But very small decoder network (parameter number 0.5M). The overall size of the network makes it difficult to train at the end of the relevant task. Therefore, the author uses a phased training process. Here, each decoder in the decoder network is gradually added to the existing trained network. The network grows until no further performance gains are observed. This growth Stop after three decoders, so ignoring a high-resolution feature map will definitely result in loss of edge information [deconvnet]. In addition to training related issues, The need to reuse the encoder feature graph in the decoder makes it more memory intensive during the test time. We study this network more deeply because it is the core of other new architectures [crfasrnn][parsenet].
By using a cyclic neural network (RNN) attached to FCN[CRFASRNN] and fine-tuning it on a large data set [Voc][coco], The predictive performance of FCN is further improved. At the same time, the RNN layer mimics the sharp boundary partitioning ability of CRF while using FCN feature characterization capabilities. They show significant improvements over FCN-8, but also indicate that this difference decreases when training FCN-8 with more training data. When training with FCN-8-based architecture, Crf-r The main advantages of NN are revealed. Joint training is helpful for other recent results. Interestingly, the performance of the Deconvolution network [deconvnet] is significantly better than that of FCN, But at the expense of more sophisticated training and reasoning. This raises the question of whether the perceived advantages of CRF-RNN will be reduced with the improvement of the core Feedforward segmentation engine. In any case, the CRF-RNN network can be attached to any deep segmented architecture, including segnet.
The multi-scale deep architecture is also widely used. They have two styles, (i) extracting the network using several scales of input images and corresponding depth features, and (ii) combining feature maps from different layers of a single deep structure [parsenet]. The common idea is to use Multiscale extraction features to provide local and global contexts [zoom-out], and the early coding layer's use of feature maps preserves higher-frequency details, This results in a more sharp class boundary. Some of these architectures are difficult to train due to the size of the parameters. Therefore, the multi-stage training process is used in conjunction with data addition. The inference process is also highly complex because of the multiple convolution paths used for feature extraction. Others attach a CRF to their Multiscale network and train them together. However, these are not before Optimization is required to determine the map label.
Several recent proposed segmentation structures are not feedforward in reasoning time [deconvnet][deeplab][decoupled]. They need map inference via CRF or recommended area Proposals[deconvnet] and other auxiliary tools. We believe that the perceived performance gains achieved by using CRF are due to the lack of good decoding techniques in their core feedforward segmentation engine. On the other hand, Segnet uses decoders to get accurate pixel-level sorting effects.
The most recent proposed deconvolution network [deconvnet] and its semi-supervised variant decoupling network [decoupled] uses the maximum position of the encoder feature graph (pooling Index) performs non-linear oversampling on the decoder network. The authors of these schemas are independent of segnet (first submitted to CVPR 2015), The decoding idea in decoding network is put forward. However, their encoder networks consist of full connections by the VGG-16 network, which includes approximately 90% of the parameters of its entire network. This makes their network training very difficult and therefore requires more assistive tools, such as the use of regional proposals to implement training. In addition, in the inference phase these propos ALS is used, which significantly increases the reasoning time. From a baseline perspective, this also makes it difficult to evaluate its architecture without additional assistance (encoder- Decoder network). In this work, we discard the full connectivity layer of the VGG16 Encoder network, enabling us to use SGD to optimize the training network using the relevant training set. Another recent approach [Deeplab] shows that without sacrificing performance, The benefit of significantly reducing the number of parameters is the ability to reduce memory consumption and improve reasoning time.
Our work is inspired by the unsupervised feature learning architecture presented by Ranzato and others. This architecture is used for unsupervised pre-training classification. However, this approach does not attempt unsupervised feature training using the deep encoder-decoder network, as they discard the decoder after each encoder training. Here, segnet and these architectures do not , because the deep encoder-decoder network is co-trained to supervise learning tasks, the decoder is part of the network in the test time.
3 architecture
As shown in Schema 2. The
Encoder section uses the first 13-layer convolutional network of VGG16, and you can try to use pre-training on imagenet. We can also discard fully connected layers, which facilitates the retention of a higher resolution feature map at the deepest encoder output. With other recent architectures [fcn][ Deconvnet], this also reduces the number of parameters in the Segnet Encoder network (from 134M to 14.7M). As shown in table 6.
Each encoder layer has a corresponding decoder layer, so the decoder network has 13 layers. The final decoder output is fed to a multilevel soft-max classifier to independently generate class probabilities for each pixel.
Each encoder consists of a convolution layer, a batch normalization layer, a relu, followed by a maximum pooling with 2x2 windows and Stride 2 (non-overlapping windows), The output is equivalent to the lower sample with a coefficient of 2. The maximum pooling is used to realize the translation invariance of the small space displacement of the input image, and the sub-sampling results in a large input image context (spatial window) for each pixel in the feature map. Due to maximum pooling and sub-sampling, the boundary detail loss is increased and must be preceded by sub-sampling Capture and store boundary information. To be efficient, we only store max-pooling indices, which, in principle, can be done with 2 bits for each 2x2 pooled window, so compared to the memory feature graph of floating precision, More efficient storage. As we'll show later in this article, this low memory storage can result in a slight loss of precision, but still applies to the actual application (?). The decoding technique for
Segnet is shown in 3. The decoder in the
Decoder network uses the stored maximum pooled index from the corresponding encoder feature graph to sample to its input feature map. This step produces a sparse feature map. These feature maps are then convolution with a trained decoder filter set to produce dense feature maps. Then the last decoder generates a multichannel The feature map, not the 3-channel (RGB). Then input to a softmax classifier. This soft-max independently classifies each pixel, and the output of the Soft-max classifier is the probability of a K-channel image, where k is the number of classes. The predicted segmentation corresponds to the class with the maximum probability at each pixel .
compared to segnet, the u-net (proposed for the medical imaging community) does not reuse the pooled indicator, but instead transfers the entire feature map (at the expense of more memory) to the appropriate decoder, and connect it on the sample (via Deconvolution) decoder feature map. In the network architecture, there are no conv5 and Max-pool in U-net 5. On the other hand, Segnet uses all the pre-trained convolution weights from the Vgg network as a pre-trained weight.
3.1 Decoder variants
Many segmented architectures [fcn][deeplab][deconvnet] share the same encoder network, and they only change in the form of their decoder network. We chose to compare the segnet decoding technique with the widely used full convolutional network (FCN) decoding technology [fcn][ CRFASRNN].
To analyze the segnet and compare its performance with the FCN (decoder variant), we use the smaller version of Segnet, called Segnet-basic, which has 4 encoders and 4 decoders. In addition, select all encoder and Decoder layer 7x A constant core size of 7 to provide a wide context for smoothing marks, that is, the pixels in the deepest feature graph (layer 4) can be traced back to the context window 106x A 106-pixel input image. This small size segnet-basic allows us to explore many different variants (decoders) and train them in a reasonable time. Similarly, we created Fcn-basic, a comparable version of FCN, for our analysis, which shares the same encoding as Segnet-basic Device Network, However, the FCN decoding technique used in all decoders (see Figure 3) is the same. The smaller variant is that the decoder filter is a variant of the single channel, that is, they simply convolution their corresponding upper-sampled feature plots. This variant (Segnet-basic-singlechanneldecoder) significantly reduces the number of parameters that can be trained and the reasoning time (?).
The important design element of the FCN model is the descending dimension step of the Encoder feature graph. This compresses the encoder feature map and then uses it in the appropriate decoder. Initializes the on-sample kernel with bilinear interpolation weights.
We can also create variants of the Fcn-basic model, which discards the encoder feature map addition steps and only learns on the sample kernel (fcn-basic-noaddition).
In addition to the above variants, we study the use of up-sampling with fixed bilinear interpolation weights, There is no need for on-sampling learning (bilinear interpolation). On the other hand, we can add 64 encoder features to the corresponding output feature map of the segnet decoder at each layer of the Seqnet decoder to create more memory-enlarged segnet (segnet-basic-encoderaddition). Use here The max-pooling indices is sampled, followed by a convolution step to make its sparse input more dense. It is then added one by one to the corresponding encoder feature map to produce the decoder output.
Another and more memory-intensive fcn-basic variants (fcn-basic-nodimreduction) are places where the encoder feature mappings are not reduced in dimension. This means that unlike Fcn-basic, the final encoder feature map does not compress to K-channels until it is transferred to the decoder network. So , the number of channels at the end of each decoder is the same as the corresponding encoder (that is, 64).
We also tried other generic variants, where the function diagram was simply sampled by copying, or by using a fixed (and sparse) indexed array. These performances are quite poor compared to the above variant. In the Encoder network (the decoder is redundant) no maximum pool and sub-sampled variants consume more memory and take longer to converge and execute. Finally, please note that in order to encourage replication of our results, we have released Caff e executes all variants.
3.2 Training
We use the Camvid Road view DataSet to benchmark performance based on the decoder variant. The dataset is small and is made up of 360x 480 resolution of 367 sessions and 233 test RGB images (daytime and Twilight scenes). The challenge is to divide the roads, buildings, cars, pedestrians, signs, Poles, side roads and other 11 classes. We normalized the local contrast of the RGB input.
Both the encoder and decoder weights use the He and other people's methods. To train all variants, we use a random gradient descent (SGD) with a fixed learning rate of 0.1 and momentum 0.9, using our Caffe to implement Segnet-basic. Before each round, the training set is shuffled, then sequentially picking each small batch (12 images), thus Make sure that each image is used only once in an era. We chose to perform the highest model on the validation data set.
We use the cross-entropy loss as the objective function of the training network. The loss is summed on all pixels in a small batch. When the number of pixels varies greatly in each category in the training set (for example, Road, sky, and building pixel-dominated camvid datasets), it needs to be weighted differently depending on the real category. This is called class
balancing. We use median frequency balancing, where the weight assigned to the class in the loss function is the median of the class frequency computed on the entire training set divided by the rate of the class frequency (?). This means that the weight of the larger class in the training set is less than 1 and the minimum class has the highest weight. We also tried different types of training, without class balancing, or equivalent to using natural frequency balancing.
3.3 Analysis
To quantify different decoder variants. Use the following measurements: the G value is the global accuracy, and the percentage of all pixels in the measurement dataset is correctly categorized. C-value class average accuracy, the average of the predictive accuracy of all classes. And there's Pascal. The Miou.miou metric for all classes used in the VOC12 challenge is a more rigorous measure than the class average accuracy because it punishes false-positive predictions. However, the Miou metric is not directly optimized by the class-balanced cross-entropy loss.
The Miou indicator is also known as the "Jacques Index", Most commonly used for benchmarking. However, Csurka and others note that this measure does not always conform to the qualitative Judgment (rank) of the quality of the human subdivision. They show in the form of an example that Miou is good for regional smoothness and does not assess the accuracy of the boundary, FCN the author has recently referred to this point. Therefore, they recommend the adoption of a non- Monitor the image segmentation quality of the Berkeley Contour matching score of the boundary Measurement to supplement the Miou metric. Csurka and others simply extend it to semantic segmentation, and show that the measure of semantic contour accuracy used in conjunction with the Miou metric is consistent with the human ordering of the segmented output.
The key idea for calculating semantic contour scores is to evaluate F1 measurements involving the calculation of predictions and ground in the case of a given pixel tolerance distance The accuracy and callback values of the truth class boundary. We use the value of 0.75% of the diagonal of the image as the tolerance distance. The F1 measurements of each class that exist in the ground real test image are averaged to produce an image F1 metric. The BF acts as a F1 metric for the entire test set.
While we use class balancing when we train variants, it is still important to achieve high global accuracy, which allows for overall smooth segmentation. We also observed that when the rank average is highest, the reported numerical performance can usually correspond to the low global accuracy that represents the perceptual noise split output.
Table 1 shows the results of our analysis.
In the best case, when both memory and inference time are not constrained, Larger models such as fcn-basic-nodimreduction and segnet-encoderaddition are more accurate than other variants. In particular, dropping dimensionality in the Fcn-basic model leads to a high BF score FCN The best performance in the basic variant. This again emphasizes the tradeoff between memory and precision in a split architecture.
We can now summarize the above analysis with the following general points:
1) The best performance when the Encoder feature diagram is all stored. This is most clearly reflected in the semantic contour mapping metric (BF).
2) When limiting the memory in inference, you can use an appropriate decoder (such as the segnet type) to store and use the compression form of the encoder feature map (reduced dimensionality, maximum clustered index) to improve performance.
3) A larger decoder improves the performance of a given encoder network.
4 Benchmark Test
We compared segnet with FCN, Deeplab-largefov and deconvnet. Qualitative result 4 shows.
The qualitative results show the proposed architecture's ability to segment smaller classes in road scenes, At the same time, a smooth segmentation of the whole scene is produced. In fact, under the controlled benchmark setting, the segnet shows excellent performance compared to some larger models. Deeplab-largefov is the most effective mode in which CRF post-processing can produce competitive results, although smaller classes are lost. The FCN with learning to convolution is significantly better than Fixed bilinear on-line sampling. Deconvnet is the largest mode, the most ineffective training. Its predictions cannot be kept in small categories. Deconvnet has higher boundary division accuracy, but compared with deconvnet, The segnet is more efficient. This can be seen from the calculated statistics in the table with fully connected layers (which become convolutional layers) to train at a slower speed, and refer to segnet with equivalent or higher forward and backward delivery times. Here we also note that the proposed merger is not a question of training these larger models , because their indicators are showing an increasing trend in comparable iterations with Segnet.
For FCN models, learning about deconvolution, instead of fixing them with bilinear interpolation, improves performance, especially for BF fractions. It also achieves a higher metric in a shorter period of time.
Surprisingly, Deeplab-largefov is predicting labels at 45x60 resolution, but it produces competitive results because it is the least parameterized model and has the fastest training time, as shown in table 6. However, the boundary accuracy is poor, as is the case with other architectures. Deconvnet's BF scored higher than other networks and trained for a long time.
CRF causes G value and Miou increase, but C value decreases, BF value is also greatly improved.
5 Discussion and future work
Due to the availability of a large number of datasets and the extended model depth and parameterization, Deep learning models tend to be more successful. In practice, however, memory and computational time during training and testing are important factors to consider when choosing a model from a large model library. Training time is an important consideration, especially when our experiments show that performance gains are not commensurate with the increased training time. Test time memory and calculate load on It is important to deploy a model on a dedicated embedded device, such as an AR application. From the overall efficiency point of view, we have less focus on the time efficiency model for real-time applications, such as road site understanding and AR, for smaller and more memory. This is the main motive of the segnet proposal, which is significantly smaller and faster than the other competing architectures, but Demonstrate the efficiency of tasks such as road site understanding.
Split challenges such as Pascal and Ms-coco are object segmentation challenges, Several of these categories exist in any test image. Scene segmentation is more challenging because of the height of the indoor scene and the need to split more classes. The task of outdoor and indoor scene segmentation is also more practical in modern applications such as autonomous driving, robotics and AR.
We have chosen a benchmark for a variety of deep partitioning architectures, such as Boundary F1 Measurement (BF), To supplement the existing indicators that are more biased towards regional accuracy. From our experiments and other independent benchmarks, it can be seen that the images of outdoor scenes captured from mobile cars are more easily segmented and the deep structure works well. We hope our experiments will encourage researchers to focus on more challenging interior scene segmentation tasks.
When benchmarking different deep architectures with different parameterization, One of the important choices we have to make is to train them. Many of these architectures have used a number of support technologies and multi-stage training formulations to achieve high accuracy in data sets, but this makes it difficult to gather evidence about their true performance under time and memory constraints. Instead, we chose to perform a controlled benchmark and we used batch normalization to use the same Solver (SGD) for end-to-end training. However, we note that this approach does not completely unravel the impact of the model and the Solver (optimization) in achieving specific results. This is mainly because training these networks involves gradient reverse propagation, which is imperfect, optimization is very large non-convex problem. Acknowledging these shortcomings, we hope that this controlled analysis complements other base and reveal the actual tradeoffs involving different well-known architectures.
For the future, we want to leverage our understanding of the segmented architecture we've collected from analytics to design more efficient architectures for real-time applications. We are also interested in estimating the model uncertainties of predictions from the deep segment architecture.
6 conclusion
We propose a segnet, a deep convolutional network architecture for semantic segmentation. The main motive behind Segnet is the need to design an effective architecture for road and interior scene understanding, This is effective in terms of storage and computing time. We analyzed the segnet and compared it with other important variants to uncover the actual tradeoffs involved in designing a segmented architecture, especially training time, memory, and precision. The architecture integrity of the storage Encoder network features is best, but consumes more memory in the inference time. On the other hand, Segne T is more efficient because it only stores the maximum pool index of the feature map, and use it for the decoder network for good performance. In large and well-known data sets, Segnet is competitive and achieves high scores on road site understanding. End-to-end learning for deep segmentation architectures is a more difficult challenge, and we want to focus more on this important issue.
"Thesis translation" Segnet:a deep convolutional encoder-decoder Architecture for Image segmentation