Paper reading record: Automatic Image colorization sig16

Last Update:2016-06-03 Source: Internet

Author: User

Tags intel core i7

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SIG Paper reading record Let there be color!: Joint end-to-end Learning of Global and Local Image priorsfor Automatic Image colorization W ITH simultaneous classification (SIGGRAPH 2016) Paper Introduction

Paper Home: http://hi.cs.waseda.ac.jp/~iizuka/projects/colorization/en/

The author is from Waseda University (Waseda University) Satoshi Iizuka, Edgar Simo-serra, Hiroshi Ishikawa

This sig paper is mainly based on deep learning to do colorization, there are a lot of related work, using CNN also have a few papers, but this paper effect is very outstanding, from the effect of Figure1 show, the color restored from gray image The Image effect is stunning.

Figure1:

Let's take a general look at the main points of this paper:

Key Point:

First of all, this paper presents an auto -coloring method based on global paper and local feature (local image features) . Use convolutional neural Network (CNN) to extract local information from small image patches and complete a global priori from the entire image. This entire process is end-to-end learning (End-to-end learning) and does not require preprocessing and post-processing.
Traditional colorization algorithms require user interaction, graph cutting and other means, and this paper based on deep learning data driven, can be fully automatic color. But with this in mind, after I've tested some of the results, I think the usability will be greatly enhanced by adding interactivity on top of the automatic generation.
The method proposed by paper can color images of any resolution, which is different from most CNN-based approaches, which are of course related to the network structure, which will be described later.
Paper proposed method can be based on the global information, to complete the different images of the style conversion, this later will be based on the network structure.
This paper uses a large-scale scene classification data set to train the model, using the classification tags in the data set for more effective and discriminating global learning, such as the use of Image category information: Indoor, outdoor, daytime, night, etc. to guide the network learning Image semantics (semantic context) Information. So that the network can distinguish the images under different scenes to improve the performance. The paper also proves that this method is much better than the simple use of local information.
The color space used for CIE l* A * b *, the paper also shows the use of l* A * b * than RGB and YUV closer to ground truth, and through this, but also subtly reduce the network learning difficulty, to ensure that the output resolution and input the same, here will be described later.
The final model of the Pan-China capability is relatively strong, not only for the current device to take the grayscale image, but also for decades ago, or even a century ago captured images.
The model consists of 4 parts (the entire framework is composed of 4 sub-network):
- Low-level feature network: The bottom of the convolutional networks, divided into two networks, the weights of the two networks are shared, used to extract the most basic features from the image, and followed by different networks.
- Mid-level feature network: A two-storey conv, the feeling for this special name is not suitable. Blend with Global-level feature network to get the fusion Layer.
- Global-level Feature Network: A cluster of several conv and FC, classifying the input fixed-size images, together with the mid-level feature network to form a fusion Layer, Allow colorization network to get global feature.
- Colorization Network: An anti-convolution net, restored from feature maps to target image.

Network structure:

This paper is mainly based on convolutional neural network, the network structure is the author carefully designed, for a non-circular graph, the author divides the network structure into four main parts, low-level feature network,mid-level feature network, Global-level feature Network,colorization network, through which we have completed the above points. Let's take a closer look at the network that the author of the thesis designed:

The entire network structure is presented in detail, and we look at the details of the entire network according to several parts of the author's division:

low-level feature network:

A 6-layer convolutional network for extracting basic features directly from the input image, the excitation function: ReLU. This part of the network is divided into two networks, the weights of the two networks are shared.

I think this is divided into two networks because the network behind it is different, a network is used for the image classification of the Global-level feature network to complete the global priori, the other network connection mid-level feature networks + Colorization Network to complete the shading.

Where Global-level feature network is a fixed size (112*112), you need to scale the image after the input, the reason is fixed size I think this is because this part of the network is used to determine the image of the environment (learning semantic context), plainly is an image classification of the network, so took a similar structure with alexnet, fixed size conv + FC structure.

Mid-level feature Network + colorization networks are not fixed size (H/2 * W/2), it is also well understood, colorization network is designed to handle any input size of the image , is a full convolutional network (FCN), so the network size depends on the original image.

Because of this design, although the global feature is calculated using a fixed-size image, the fusion layer blends the global feature with the local feature, making it possible to calculate images of any resolution.

This part of the network is not the next sampling of the Max pooling layer, but the use of strides, this method is better than Max pooling, is now more popular.
(use convolution layers with increased strides)

mid-level feature network:

This part of the network is a two-layer conv, the excitation function: ReLU, feel that this part of a separate name is a bit far-fetched, may be to more clearly describe the entire network bar. This part of the network input for the low-level feature network output of the H/8 * W/8 * 512-dimensional features, after two levels of convolution God will network output H/8 * W/8 * 256-dimensional features and Global-level feature network output 25 6-dimensional vectors are fused to get the fusion Layer.

global-level Feature Network:

Because the author uses the places sence Dataset with category labels (2448872, 205 categories), these tags allow the web to learn the semantic context of the photo as a global feature. Therefore, after the Global-level feature network is followed by a classificationnetwork to complete the scene classification, using the mutual entropy loss to complete the training.

So this part of the network is a classification network consisting of 4 conv and 3 FC (excitation function: ReLU). The white is to classify the input fixed-size picture, output as a 256-dimensional vector, and mid-level feature network together form the fusion Layer, so that colorization network can get global Feature

fusing Global and Local Features (fusing Layer):

The author has always emphasized the integration of global features and local features, and I think this fusing layer is one of the key points of the framework. By combining the characteristics of different networks, we learn different features and improve performance. For example, if the global feature indicates that it is an indoor image, the local feature will prefer not to attempt to add the color of the sky or grass, but rather to try the color of the furniture.

The mid-level feature network will low-level the H/8 * W/8 * 512-dimensional features of the feature network output, and after two levels of convolution God will be the net, the H/8 * W/8 * 256-dimensional features; Global-level F The global feature of the Eature network output is a 1256-dimensional vector, which is fused by the fusing layer proposed by the author, enabling the Colorization network to be shaded based on local and global information.

The Fusion formula is:

Y Fusion u,v =σ (b+W [y < Span class= "Mrow" id= "mathjax-span-58113" >g l o b a l Y m I d Span style= "Display:inline-block; width:0px; Height:2.456em; " > u,v ])

The Global-level feature network and mid-level feature networks are fused through a layer of networking, and the example is clearly drawn.

colorization Network:

This part of the network is a series of conv layer and upsample layer composed of a deconvolution network, similar to the auto-encoder of the second half of the network, from feature maps to the target image. The upsample layer uses the nearest neighbor method (nearest neighbour) to continuously sample the ground, conv and upsample continuously cross output until the size is half the input image.

The output layer of the Colorization network is a conv layer that uses the SIGMOD function as an excitation function to output the chroma (chrominance) of the original grayscale image.

At the end of the network, the chroma (chrominance) is combined with the input luminance graph (luminance) (the input grayscale) to produce the final color image.

This is a very ingenious place for paper, previously said that the author has been using the CIE l* A * b * color space, so why not use the usual RGB,YUV? (In fact, the paper also did the RGB and YUV contrast experiment)

The result of the network here is chroma (chrominance), which is a * b * in CIE l* A * b *, rather than generating RGB directly. This is because the grayscale image itself is l*, so using CIE l* A * b * can only learn a * b*, instead of learning l*, which not only reduces the learning difficulty of the network, but also does not need to change the original l*.

Comparison experiment of RGB, YUV, l* A * b * in paper:

Training: target function:

Let's take a look at the objective function:

The objective function consists of 2 loss functions, one is the colorization network prediction and the European distance between the target image, and the other is Global-level feature Network + The mutual entropy loss of the classificationnetwork classification (cross-entropy loss), which is a very common loss function. There is no need to explain too much, and the two use Alpha variables to control the weights of different networks.

The author also mentions the method of directly using the final color result and the Euclidean distance between the Grandtruth as the loss and the gradient to return the entire network, but this will lead to the inability to learn the global features well.

And I think classificationnetwork here not only plays the role of learning global features,semantic context, but a more important role is to mitigate the gradient of gradient return to a certain extent, Make such a large network easier to learn.

In the reverse propagation process, color loss affects the network, and the classification loss affects only the Classificationnetwork,global-level feature network and the shared low-level feature network. Does not affect colorization network and mid-level feature network.

Training Optimization:

Some optimizations for the training process are also mentioned in paper:

One is the size of the training data selection, if the use of the input image of size is 224 * 224 pixels, then you can see that the low-level feature network two subnets will be identical, then two sub-network outputs can be shared , that is, training only need to train a network to be able to, and then put outputs into the next two different networks, for the author to use a relatively large data set, there is less training a sub-network is still very useful, so the author in the training data set using 224 * 224 pixel image.

And then in order to make the network can speed up convergence or can converge, paper also mentioned can use the current comparison of fire batch normalization,2015 year, only a year has been 300+ quoted, specifically can see the great God's blog:
http://blog.csdn.net/hjimce/article/details/50866313

and use Adadelta Optimizer to optimize the objective function and speed up training.

Time:

Paper mentions training the network using datasets with 2,448,872 training image and 20,500 validation images, with a total of 205 scene categories, such as monasteries, conference centers, volcanoes. Training uses a batch size of 128, about 200,000 iterations, 11 epochs, and training on the Nvidia Tesla K80 GPU takes exactly 3 weeks.

As for computing time, the author says the GPU is probably near real time, but the author's GPU is the nvidia GeForce GTX TITAN X. The CPU is Intel Core i7-5960x.

Style conversion:

Also mentioned before, paper mentioned can do different images of the style of transfer, in fact, is also a small use of the network structure: Because the low-level feature network is the composition of two networks, a corresponding original input for local feature extraction, A fixed-size input completes the global classification, and when these two inputs correspond to different input images, the style of the fixed-size input image is grafted onto the original input. This is due to the Global-level feature network learning is caused by the global feature and semantic context of fixed size input images.

Here are some of the style conversion results:

Results:

The results of a number of validation sets were posted in the paper, which looked very good:

Even in some 100 years ago, the old photos have also achieved good results, indicating that the model's generalization ability is still very good:

In order to prove the effect of adding global features, we have also done experiments that use only the upper half of the network, do not use the global information, here are some comparisons, baseline is to make the Fusion formula Alpha 0, that is, discard the global characteristics of the experimental results.

You can see the above image of the indoor ceiling as the sky, and the addition of global information will not be the case, the indoor warm light simulation is very good.

In addition, because the grayscale map to the color map is irreversible, so the gray map corresponding to the color map may be a variety of circumstances, so and ground truth there is a gap is very normal, such as paper in this example:

Of course, this is the most unsuccessful example of the paper mentioned, I downloaded the author's model from the author's GitHub, and then casually find some of their own school ah, online photos ah, to try, I found for grass and sky, wood texture and so on, these natural scenes, as well as human skin, the effect of color is very good , for the building and people's clothing is worse, sometimes the building color is very strange, clothes in many cases are brown. For some messy pictures, even paintings, the feeling model will always be filled with brown, the following are some of the results, we feel like a little bit better:

Let's start with a few good results:

Some of the natural landscapes you'll find on the Internet:

GT:

proposed:

GT:

proposed:

GT:

proposed:

GT:

proposed:

The school building hehe, tried several building effect is not good, but the grass and the tree is very lifelike
GT:

proposed:

textures are more complex, and estimates don't have this semantics.
GT:

proposed:

Also tried some of their classmates photos, the skin part do really good, involving others rights, do not upload.

Old photos:

Input:

proposed:

Input:

proposed:

Old graduation photos from the internet
Input:

proposed:

Input:

proposed:

Input:

proposed:

Input:

proposed:

Paper reading record: Automatic Image colorization sig16

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More