Computer Vision (Computer Vision, CV) is a science that studies how to make machines "look". The first doctoral thesis in the field, published by Larry Roberts, MIT, in 1963, "Machine perception of three-dimensional solids," marks the beginning of the study of CV as a new AI direction. Today, after more than 50 years of development, let's talk about a few interesting attempts to make computer vision "out of Nothing": 1 super-resolution Reconstruction; 2 image coloring; 3) image-speaking; 4) portrait restoration; 5 images are automatically generated. It can be seen that these five tries layer by step, the difficulty and the degree of interest also gradually improve. (Note: This article only talk about visual problems, do not mention too specific technical details, if you are interested in a certain part of the future to write a separate article to discuss:)
Super resolution Reconstruction (Image super-resolution)
Last summer, a "Waifu 2x" Island application in animation and computer graphics is really a fire. Waifu 2x uses the depth "convolution neural network" (convolutional neural Network, CNN) to increase the resolution of the image by twice times, and also to reduce the image noise. Simply put, the computer "made out of nothing" to fill some of the original image of the pixel does not, so that the comics look more clear and real. You may wish to look at the picture below, really want to see the childhood is so high definition of the Dragon bead Ah.
However, it should be pointed out that the image of Super-resolution research began around 2009, but only in the "depth of learning" development, Waifu 2x can make better results. In the specific training of CNN, the input image is the original resolution, and the corresponding super-resolution image is used as the target, which constitutes the training "image pair", and the super resolution reconstruction model can be obtained by model training. Waifu 2x's Depth network prototype is based on the work of the Tong Xiao team of the Chinese University of Hong Kong [1]. Interestingly, [1] indicates that the depth model can be given in a qualitative way using traditional methods. As shown below, a low-resolution image can be obtained after the pooling (convolution) and the pooled (feature map) operation. Based on the low resolution feature graph, the nonlinear mapping (non-linear mapping) from low-resolution to high-resolution feature graphs can also be realized by convolution and pooling. The final step is to reconstruct high-resolution images using high-resolution feature maps. In fact, the three steps are consistent with the three processes of conventional super-resolution reconstruction methods.
Image coloring (Images colorization)
As the name suggests, image coloring is the original "no" color black and white images for color filling. The image coloring also uses the convolution neural network, the input is the Black-and-white and the corresponding color image pair, but only by contrasting the black and white pixel and the RGB pixel to determine the fill color, the effect is not good. Because the result of color filling is consistent with our cognitive habits, for example, it's weird to paint a star's hair green. So recently, the Waseda University published in the 2016 computer graphics international top-level conference on the siggraph of a work [2] on the original depth model based on the inclusion of the "classification network" to predetermine the objects in the image category, as the "basis" to do with color fill. The following figure is the model structure diagram and the color recovery demo, the restoration effect is quite realistic. In addition, this kind of work can also use in Black-and-white movie's color restores, the operation only then simply takes out each frame in the video to make the colorization can.
"Look at the picture and talk" (Image Caption)
Often said "illustrated", the text is in addition to the image of another way of describing the world. Recently, a study called "Image caption" gradually warmed up, its main task is through computer vision and machine learning methods to automatically generate a picture of the human natural language description, that is, "look at the image of speech." It is worth mentioning that in this year's CV international Top will CVPR, Image caption was listed as a separate session, the heat can be seen. Generally speaking in Image caption, CNN is used to obtain image features, and then the image features as a language model LSTM (RNN) input, the whole as a "end-to-end" of the structure of joint training, the final output of the image of the language description (see below).
The best results of the current Image caption Field [3] are from the Chunhua Shen Professor team from University of Adelaide, Australia. Compared to the previous Image caption work, their improvement is similar to the color restoration just mentioned, but also consider using the object's category as a more accurate "basis" for better generation of natural language descriptions, that is, the part of the red box circled in the following figure. The rapid development of Image caption not only accelerates the integration of CV and NLP in the field of AI, but also lays a solid technical foundation for augmented reality application. In addition, we are more willing to see the future of the increasingly mature Image caption technology embedded in wearable equipment, that day blind people can indirectly "see the light."
Portrait restoration (Sketch inversion)
At the beginning of June, Dutch scientists published their latest research on arxiv [4]--] to "restore" the face contour map through a deep network. As shown in the following figure, in the model training phase, the real face image is firstly obtained by using the traditional offline edge method to obtain the contour of the corresponding face, and the "image pair" composed of the original picture and the contour map is used as the input of the depth network, and the model training is similar to the super-resolution In the prediction stage, the input is the human face contour (left two sketch), after the convolution neural network layer abstraction and the subsequent "reduction" can gradually restore the photo-like face image (the right one), and the most left face true image contrast, enough to the real person. In the model flow chart also shows some of the results of portrait restoration, the left one as a real portrait, the middle listed as a painter hand-painted face contour map, and as a network input for portrait restoration, and finally to the right side of the recovery results--visual investigation police no longer need to practise art