Real-time style conversion and super-resolution reconstruction based on perceptual loss function
Article: "Perceptual losses for real-time Style Transfer and Super-resolution"
From: Real-time style conversion and super-resolution reconstruction based on perceptual loss function (Zhwhong)
Absrtact: We consider the problem of image conversion, which transforms an input image into an output image. The most popular method of image conversion is to train the Feedforward convolution neural network, which is the loss function of the output image and the pixel gap of the original image. Parallel work shows that high quality images can be generated by extracting advanced features, defining and optimizing the perceptual loss function by using a pre trained network. We have combined the advantages of these two methods, and put forward the task of training Feedforward Network for Image transformation using the perceptual loss function. This paper gives the results of the image style, training a feedforward network to solve real-time optimization problems (gatys, etc.), and based on the method of optimization, our network produces the quality of the results, but can achieve three orders of magnitude of speed. We also experiment with the super-resolution reconstruction of single graph, and use the perceptual loss function to replace the loss function of pixel difference.
keywords: style conversion, super resolution reconstruction, depth learning
Many classical problems can be divided into Image transformation task, that is, a system receives some input images, converts them to output images. Examples of image processing, such as image denoising, super-resolution reconstruction, image coloring, which is to enter a degraded image (noise, low resolution, grayscale), output a high-quality color image. From computer vision to examples, including semantic segmentation, depth estimation, where the input is a color image, output is the image of the scene semantic or geometric information encoded.
One way to deal with Image transformation task is to train a feedforward convolution neural network in supervised mode, and to measure the gap between output image and input image by pixel difference as loss function. This method by Dong and other people used to do a super resolution reconstruction, was Cheng and other people do the image color, long and other people do the image segmentation, Eigen and other people did the depth and surface prediction. The advantage of this approach is that when testing, only one feed-forward pass through a trained network is required.
However, each of these methods uses the loss function of pixel-by-row, which fails to capture the difference in perception between input and output images. For example, consider two identical images, with only 1 pixel offsets, although the images are the same from perception, the images will be very different if measured by the method of pixel difference.
At the same time, some recent work proves that high-quality images can be created by creating a perceptual loss function (not based on a pixel gap, instead of extracting a high-level image feature from a pre trained CNN) The image is generated by minimizing the loss function, which is applied to the feature inversion  (Mahendran, etc.), feature visualization  (Simonyan etc.), texture synthesis and image styling [9,10] (Gatys, etc.). These methods produce very high quality images, but are slow because of the long iterative optimization process.
In this paper, we combine the advantages of two kinds of methods. We train a feedforward network for image transformation tasks, construct the loss function without pixel difference, and use the perceptual loss function to extract advanced features from the pre trained network. In the course of training, the perceptual loss function is more suitable to measure the similarity between images than the pixel loss function, in the process of testing, the generator network can be converted in real time.
We experimented with two tasks, image styling and Dantu reconstruction. Both of these are inherently flawed: image styling does not have the only correct output, and super-resolution reconstruction allows us to reconstruct many high-resolution images from a low-resolution image. Better, both tasks require a semantic understanding of the input image. In the image styling, the output picture must be close to the input image from the semantic dimension, although the color and texture will be changed in a qualitative way. In the Super-resolution reconstruction task, new details must be inferred from the low-resolution input with blurred vision. In principle, a high quality neural network trained for any task should be able to implicitly learn the relevant semantics of the input image; In practice, however, we do not need to learn from scratch: Using the perceptual loss function allows direct transfer of semantic information from the loss network to the switching network.
Figure 1: Our results, the first line is stylized, the second row is 4 times times the super-resolution reconstruction
For image styling, our Feedforward networks are used to solve the optimization problem ; Our results are similar to those in  (either mass or target function), but can be achieved at speeds of 3 orders of magnitude. For super-resolution Reconstruction, we confirm that the pixel difference loss function can be changed to the perceptual loss function, which brings 4 times times and 8 times times the super-resolution reconstruction of the visual enjoyment level.
two. Related work
Feedforward Image Conversion : In recent years, the Feedforward image conversion task has been widely used, and many transformation tasks have been used to train the deep convolution neural network by the way of pixel difference.
The method of semantic segmentation [3,5,12,13,14,15] produces a dense scene tag, which runs the network in full convolution on the input image, and is matched with a loss function classified by pixels.  spans the pixel-by-byte difference, and trains the other parts of the network by adding the CRF as a rnn. The structure of our transformation network is inspired by  and , using the lower sampling in the network to reduce the spatial range of the feature map, followed by a sample on the network to produce the final output image.
The most recent methods are similar in depth estimation [5,4,16] and surface vector estimation [5,17], which convert a color input image into a geometric image, using a feedforward neural network, which returns the loss function of [4,5] or classification  by pixel. Some methods are used to change the pixel difference to a penalty image gradient or use CRF loss layer to force the output image to be consistent.  A feedforward model is trained by the loss function of pixel difference to color grayscale images.
Perceptual Optimization: There are a number of papers used to optimize the way to produce images, their objects are perceptual, and the perception depends on the high-level features extracted from CNN. Images can be generated to maximize the classification predictions of the score [7,8], or individual characteristics  used to understand the training network when the function code. Similar optimization techniques can also be used to generate high confidence in confusing images [18,19].
Mahendran and Vedaldi reverse features from convolution networks, minimize feature reconstruction loss functions, to understand image information stored in different network layers, and similar methods are used to reverse local binary descriptors  and hog characteristics .
The work of Dosovitskiy and Brox is most relevant to us, they trained a feedforward neural network to invert convolution characteristics, and quickly approximated the  proposed solution to the optimization problem, but their feedforward networks were trained by pixel-by-element reconstruction of the loss function, and our network was directly used [6 ) to reconstruct the loss function with the feature.
Style conversion: Gatys and others show artistic style conversion, combining a content graph with another style diagram, the cost function of style reconstruction is based on the cost function of the feature reconstruction, which is also derived from the advanced features extracted from the training model. A similar method is also used for texture synthesis. Their methods produce very high quality demerit, but the calculation cost is very expensive because each iterative optimization requires feedforward, feedback training of the entire network. To overcome the burden of such a computational load, we trained a feedforward neural network to obtain a feasible solution quickly.
Image Super Resolution reconstruction . Image super-resolution Reconstruction is a classic problem, many people put forward a very wide range of technical means to do image super-resolution reconstruction. Yang and others offer a detailed assessment of common technology, which classifies super-resolution reconstruction technology as a predictive method before widespread adoption of CNN. (Bilinear, Bicubic, Lanczos, ), Edge based Methods [25,26], statistical methods [27,28,29], block based methods [25,30,31,32,33], Sparse dictionary methods [37, 38]. The most recent achievement in Tantu resolution amplification is the use of a three-layer convolution neural network to calculate the loss function in a pixel-by-row way. Some other art-sense methods in [39,40,41]
As shown in Figure 2, our system consists of two parts: a picture conversion network FW and a loss network φ (used to define a series of loss functions L1, L2, L3), the picture conversion network is a depth residual network, the parameter is weighted W, it puts the input picture x through the map Y=FW (x) Convert to output picture Y, each loss function calculates a scalar value Li (y,yi), which measures the output of the Y and the target image of the gap between Yi. The image conversion network is trained with SGD, which causes a series of loss functions to be weighted and kept down.
Figure 2: System Overview. The left side is generator and the right side is a VGG16 network (always fixed)
To clarify the disadvantage of a pixel-by-element loss function, and to make sure that our loss function is better at measuring image perception and semantic gaps, we are inspired by the series of work on the recent optimization iterations to generate pictures [6,7,8,9,10], the common key point of which is that CNN is trained for image categorization, This CNN has learned to encode perceptual and semantic information, which is what we want to do in our loss function. So we used a pre trained network φ for image classification to define our loss function. Our switching network is CNN, and the loss function it uses for training is also CNN.
Loss of network φ is the ability to define a feature (content) loss Lfeat and a style loss lstyle, respectively, to measure differences in content and style. For each of the input images x we have a content target YC a style target Ys, for style conversion, content target YC is input image x, output image y, should combine style ys to content x=yc. We train a network for each target style. For Tantu resolution reconstruction, input image X is a low-resolution input, the target content is a real high-resolution image, style reconstruction is not used. We train a network for each super resolution factor.
3.1 Image Conversion Network
Our Image transformation network structure is largely guided by Radford's guidelines . Instead of using any of the pool layers, we use the stride convolution or micro-stride convolution (http://www.jiqizhixin.com/article/1417) to do the upper or lower sampling on the network. Our neural network consists of five residual blocks , with the structure of  said. All the non residual convolution layers are followed by a batch-normalization[45 of space], and the nonlinear layer of the Relu, with the exception of the last output layer. The last layer uses a scaled tanh to ensure that the pixel of the output image is between [0,255]. Except for the first and last layers with 9x9 kernel, all other convolution layers are in 3x3 kernels, and the exact structure of all our networks can be seen in supporting documents.
input and output: for style conversion, input and output are color pictures, size 3x256x256. For super-resolution reconstruction, there is an upper sampling factor F, the output is a high-resolution image 3x288x288, input is a low-resolution image 3x288/fx288/f, because the image conversion network is full convolution, so in the test process it can be applied to any resolution of the image.
lower sampling and upper sampling : For super-resolution Reconstruction, there is an upper sampling factor F, we use a few residuals to follow log2f volume and Network (STRIDE=1/2). This processing is not the same as in ,  uses a double cubic interpolation to sample this low-resolution input before putting the input into the network. Without relying on any fixed upper sampled interpolation function, the micro-step convolution allows the sampling function to be trained together with other parts of the network.
Figure 3, similar to , we used an optimized way to find an image y, can minimize the loss of features (content) for certain layers, using a vgg16 network, the content and spatial structure of the images are preserved when we rebuild with higher levels, but the colors, textures and exact shapes change.
For image conversion, our network uses two stride=2 convolution to sample input, followed by several residual blocks, followed by two convolution layers (STRIDE=1/2) to sample. Although the inputs and outputs have the same size, there are some other benefits to the process of sampling and then sampling.
The first and foremost benefit is computational complexity. With a simple implementation, for example, a 3x3 convolution has a C fiters, input dimensions C x H x W need to 9hwc^2 multiplication, this cost and 3x3 convolution has DC filter, input size dcxh/dxw/d is the same. After the next sampling, we can use a larger network at the same computational cost.
The second benefit is the effective perception of the field size. High-quality style conversion requires a consistent change in a large chunk of the picture, so the advantage is that every pixel in the output has a large area of active field in the input. Except for the next sample, each additional 3x3 convolution layer can increase the size of the field by twice times, after sampling with factor D, each 3x3 convolution does not increase the size of the sensation field to 2D, giving a greater sense of the field size but with the same number of layers.
Residual connection: he and other people use the residual connection to train very deep network to do image classification, they demonstrate that the residual connection makes it easier for the network to learn the defined function, which is also an attractive research in the image Transformation network, because in most cases the output image should share the structure with the input image. So our network is made up of a few residual blocks, each containing two 3x3 convolution layers, which we use in  for the residual blocks that are designed in the appendix.
3.2 Perceptual loss function
We have defined two perceptual loss functions to measure high-level perceptual and semantic differences between two images. A network model that is designed to be used in a predefined image classification. In our experiment, this model is vgg-16, which uses the Imagenet dataset to do the pre training.
Figure 4 and , we use an optimized way to find a graph Y, minimizing the loss of style from a certain layer of VGG16. Image y only preserves style features without preserving space structures.
feature (content) loss: We do not recommend pixel-by-row comparisons, but use VGG to compute the advanced feature (content), which is the same as the artistic style using the VGG-19 feature, the formula:
As in  and reproduced in Figure 3, find an image y minimizes the loss of feature in the lower layer, often producing images that are visually indistinguishable from Y, and if reconstructed with high levels, the content and global structure are preserved, but the color texture and exact shape no longer exist. Using a feature loss to train our image conversion network allows the output to be very close to the target image y, but not to make a complete match.
loss of Style: feature (content) loss punishes the output image (when it deviates from the target y), so we also want to punish the stylistic deviations: color, texture, common patterns, and so on. In order to achieve this effect gatys and others have proposed the following style reconstruction loss function.
Let φj (x) represent the first J layer of network φ, and the input is x. The shape of the feature map is the CJ x Hj x Wj, the definition matrix (x), and the CJ X CJ Matrix (feature matrix) where the elements come from:
If we understand φj (x) as a CJ dimension feature, each feature is the HJ x Wj, then the upper left-side (x) is proportional to the non center covariance of the CJ dimension. Each grid position can be used as a separate sample. This therefore captures which features can drive other information. The gradient matrix can be very funny in times of calculation, by adjusting the shape of φj (x) to a matrix ψ, the shape is CJ x HJWJ, and then the (x) is ΨΨT/CJHJWJ.
The loss of style reconstruction is well defined, even when the output and the target have different dimensions, because there is a gradient matrix, so both will be adjusted to the same shape.
As described in , as shown in Figure 5 reconstruction, you can generate a picture Y to minimize the loss of style, thus preserving the stylistic features, but not preserving the architectural features of the space.
To represent a style reconstruction from a collection layer, rather than a single layer reconstruction, we define Lstyle (Y^,y) as a loss set (sum for each layer of loss).
3.3 Simple loss function
In addition to perceptual loss, we have defined two simple loss functions, using only low dimensional pixel information
pixel loss: Pixel loss is the standardization gap between the output graph and the target graph. If both shapes are Cxhxw, then the pixel loss is lpixel (y,y) = | | y^-y| | ₂²/chw. This can only be used when we have a fully defined target and let the network do exactly that.
Total variation regularization: in order to make the output image smoother, we follow the previous study on feature inversion [6,20], the research on super-resolution reconstruction [48,49] and the use of total variation regularization LTV (y). (total variation regularization is generally used in signal denoising)
four. The Experiment
We experimented with two image transformation tasks: style conversion and Tantu resolution reconstruction. In style conversion, predecessors used optimizations to generate images, our feedforward networks produce similar qualitative results, but speed up to three orders of magnitude. In the single image super resolution, using the pixel-by-pixels loss of the convolution neural network, we show exciting and qualitative results by switching to perceptual loss.
4.1 Style Conversion
The goal of style conversion is to produce a picture, which has the content information of the content map, and has the style information of the style chart, we have trained an image conversion network for each style, these kinds of styles are selected by hand. Then compare our results with the results of the underlying gatys.
Baseline: as a baseline, we reproduce the Gatys method, giving style and content objectives Ys and YC, layer I and J presentation features and style reconstruction. Y is obtained by solving the following problems.
Lambda starts with parameters, Y is initialized to white noise, and is optimized with LBFGS. We found that unconstrained optimization equations usually cause the pixel value of the output picture to go beyond [0,255], make a fairer comparison, and for the baseline, we use L-BFGS projection, each iteration to adjust the picture y to [0,255], and in most cases, Operation optimization converges to satisfactory results within 500 iterations, and this method is slower because each LBFGS iteration requires feedforward feedback through the VGG16 network.
Training Details: Our style conversion network is trained with Coco data sets, we adjust each image to 256x256, a total of 80,000 training charts, batch-size=4, iterations 40,000 times, about two rounds. With Adam Optimization, the initial learning rate is 0.001. The output graph is used with a full variable regularization (strength between 1e-6 and 1e-4), which is selected through a cross validation set. No weight attenuation or dropout, because the model does not fit in either of these two rounds. For all the style conversion experiments we take relu2_2 layer to do content, relu1_2,relu2_2,relu3_3 and relu4_3 as style. VGG-16 Network, our experiment used torch and CUDNN, trained for about 4 hours on a GTX Titan X GPU.
qualitative results: in Figure 6 we show comparisons of results, compare our demerit points and those underlying methods, and use some style and content graphs. All the parameter λ is the same, and all the training sets are selected from the ms-coco2014 validation set. Our method can achieve the same quality as the basic method.
Although our model is trained with 256x256 images, it can be used on any image at test time, in Figure 7 we show some test cases and use our model to train 512 size pictures
Figure 6, use our image generation network to do image style conversion. Our results are similar to Gatys, but faster (see table 1). All the build graphs are 256x256.
Fig. 7 Our network in the 512x512 diagram of the test sample, the model with a full convolution operation to achieve high-resolution images (test), style and Figure 61 sample.
Through these results, it is clear that the semantic content of the image can be realized by the style conversion network. For example, the beach image in Figure 7 is clearly recognizable, but the background is distorted by style; Similarly, the cat's face is clearly identified, but his body is not identified. One explanation is that the VGG16 network is trained to classify, so the identification of the subject (human and animal) of the picture is more complete than the background.
Quantitative results: The basic method and our approach are to minimize the formula 5. The basic method is to explicitly optimize a graph (for the image to be exported) Our method trains a solution (can handle any picture YC in feedforward) we can quantify the comparison between these two methods by measuring their success in reducing the cost function. (Formula 5)
We trained 50 pictures (from the Mscoco validation set) with our methods and their methods, using the Muse by Pablo Picasso as a style map. For the underlying method we have recorded the value of the function in each iteration process. For our method we recorded the value of formula 5 for each picture. We also calculated the value of Formula 5, when the Y and the output image YC are equal, the result is shown in table 5, and we see that the content graph YC achieves very high losses, and our method is about the same as 50-100.
Although our networks are trained in 256x256 sizes, they are able to minimize the cost function in 512,1024 situations, as shown in table 5. We can see that even at a high resolution, the same time as the common method to achieve the same loss is similar.
Table 1 Comparison of Speed (SEC): Our network vs. a common, optimized network. Our method can give a result of similar quality (see Figure 6) but it can accelerate hundredfold. Both methods are tested on the GTX Titanx GPU.
speed: in table 1 we compared the time of the run (our method and the underlying method) for the underlying method, we recorded the time, for all the image size, we can see that our method is basically the time to run the basic method of iteration time is twice times. Compared to the 500 iterations of the basic method, our method is three orders of magnitude faster. Our approach produces 512x512 images in 20fps, allowing him to be used in real-time image transformations or video.
4.2 Single Image Super resolution reconstruction
In Tantu resolution reconstruction, the task is to produce a high-resolution output picture from a low-resolution input. This is an inherent morbid problem, because for a low-resolution image, it is possible to correspond to a number of high-resolution images. When the super resolution factor becomes larger, the uncertainty becomes larger. For a larger factor (x4 x8), the good detail in a high-resolution image is likely to be little or nothing in its low-resolution version.
To solve this problem, we trained a super resolution reconstruction network, instead of using the pixel-by-line loss function used in the past, and replaced it with a feature reconstruction loss function (see section 3) To ensure that semantic information can be transferred from a trained loss network to a hyper-resolution network. We focus on super-resolution reconstruction of X4 and x8, as larger factors require more semantic information.
The traditional indicator to measure the super-resolution is Psnr and Ssim, both of which have nothing to do with human visual quality [55,56,57,58,59]. Psnr and Ssim only rely on low-level differences between pixels, and act on the multiplication of Gaussian noises, which may be ineffective for super-resolution. In addition, the Psnr is equal to the pixel difference, so the model training process measured by PSNR is to minimize the loss per pixel. Therefore, we emphasize that the goal of these experiments is not to achieve advanced Psnr and SSIM results, but to demonstrate qualitative differences in quality (pixel loss function vs perceptual loss)
Model Details: We train the model to complete x4 and X8 super-resolution reconstruction, by minimizing the loss of feature (Vgg16 at the relu2_2 level), using 288x288 small blocks (10,000 Mscoco training sets) to prepare low-resolution inputs, The double cubic interpolation is used for sampling with Gaussian kernel fuzzy (σ=1.0). We trained bacth-size=4, trained 200,000 times, Adam, learning rate 0.001, no right to heavy decay, no dropout. As a subsequent processing step, we perform a histogram matching of network output and low-resolution input.
Foundation: The basic model we use SRCNN for its excellent performance, SRCNN is a three-layer convolution network, the loss function is pixel by point, used in the ILSVRC2013 data set of the 33x33 picture. SRCNN did not train to X8 times, so we can only evaluate the X4 difference.
SRCNN has trained more than 100 million iterations, which is not possible on our model. Taking into account the differences between the two (SRCNN and our models), the differences in data, training, and structure. We train the Picture Transformation Network x4,x8 uses the loss function of pixel to difference, these networks use the same data, structure, training network to reduce lfeat evaluation: We evaluate the model, in the standard set 5, set 6,bsd100[41 the dataset, The Psnr and ssim We report are calculated only on the Y channel (when converted to YCbCr color space), just like [1,39].
results: We showed the results of X4-doubling super-resolution reconstruction (Figure 8), compared to other methods, our models are trained with feature reconstruction, and get good results, especially at the edge of the sharp edges and the details, such as the eye lashes of fig. 1, and the detail elements of fig. 2. The loss of feature reconstruction causes a slight cross shading pattern to be magnified and is better than the basic method.
x8 magnification shown in Figure 9, we once again see our model in the edge and the details of excellence. Like the foot of that horse. The Lfeat model does not have an undifferentiated sharpening edge; Compared to the Lpixel model, the Lfeat model sharpens the edges of horses and knights, but the trees in Beijing are not sharpened. It may be that the Lfeat model focuses more on the semantic information of the image.
Because our Lpixel and lfeat models have the same structure, data, and training process, all the differences are due to the difference between Lpixel and lfeat. Lpixel gives a lower visual effect, a higher Psnr value, and Lfeat has a better performance in the details of the reconstruction, with good visual results.
In this article, we combine the benefits of Feedforward networks and methods based on optimization to train feedforward networks by using perceptual loss functions. We have achieved good performance and speed by applying this method to style conversion. This method is applied to the super-resolution reconstruction, which proves that training with perceptual loss can bring more good details and edges.
In future work, we expect to use the perceptual loss function in more other image conversion tasks, such as coloring or semantic detection. We also intend to study the data sets of different loss networks for different tasks, or more different semantic information.