The progress of Computer vision 20 years (1995~2015)
The two main sections of computer vision are geometry and recognition, where we mainly talk about the progress of computer vision in 1995-2015 years.
1. The birth of the image feature point detection operator (detector) and the description operator (descriptor) Sift (1999, 2004)
Scale invariant feature transform (SIFT) was first presented by David Lowe, a professor at UBC in 1999, and further refined and published in 2004 by the image feature point detection (detector) and description operator (DESCR Iptor). The birth of SIFT is a milestone in the progress of computer vision, which makes homography estimation, structure from motion, epipolar geometry and the robotics of robots (slam) have a leap-forward improvement , precisely because sift is better than any description before it, makes the match more accurate. Sift is not only applied on geometry, it is later widely used in target recognition (object recognition) (see later).
2. Feature Engineering: The widespread emergence of descriptors (descriptor) (1995 ~ 2010)
Prior to the prevalence of deep learning (feature learning), scholars manually designed (manually craft) a number of point, image patches, spatial-temporal volumetric cube and 3D mesh strokes These descriptors are generally anti-noise, insensitive to rotation, illumination, proportions, contrast, and so on. In addition to sift, other notable operators are:
(1) Shape context, which was presented by Cornell Tech's Professor Serge Belongie in 2002, uses a commonly used binning in computer vision to describe the shape context around a point, using a balanced binn in the angular direction. ING, while in the radius direction, uses the Log-polar binning, so that intuitively the closer the point is to shape the greater the impact. Shape context is a very successful form descriptor, for the shape recognition of 2D, at that time mnist handwriting recognition to achieve the best results.
(2) HOG: Its full name is histogram of oriented gradients, introduced in 2005 by Dalal & Triggs, applied to pedestrian detection. Hog is different from sift: Hog is used to describe the whole patch, not as keypoint as sift; hog has no rotation-invariant features. HOG later widely used for other target recognition, the most successful extension is based on HOG deformable parts model (DPM, presented by Professor Felzenszwalb in 2010), it is the best object learning before deep dete Ction & Recognition 匴 method.
(3) Spin Image: It is a description of 3D mesh, presented by Dr. Andrew Johnson in 1997, and perfected in 1999. It is used to make surface matching, today laser scanners (Laster scanner) are becoming more common and cheaper, so point cloud data is becoming more common, and spin image can be used directly for point cloud matching. Because the spin image descriptor is a local-based coordinate system – its XY plane is the tangent plane of that point, Z is the normal of the point (normal), and the direction of the XY axis does not need to be determined (unlike when calculating sift descriptor, you need to align the axes to dorminant Direction) – When two points from different global coordinate system point clouds are described using shape context, they can be compared directly to Euclidean distances.
(4) In addition to these very successful descriptors, the others are STIP (space-time Interest Points, 2005), HOF (histogram of oriented optical flow,), MBH (motio N Boundary histogram, 2013).
3. Target recognition, object recognition (2005 ~ 2010)
2010 years ago, when deep learning was used for target recognition, there was no large-scale image database (ImageNET 2009 acquisition), the first database for target recognition was currently Stanford Fei-fei Professor Li Caltech Caltech101, which is collected during the reading of the Bo, has 101 categories of targets, each of which has 40~800 images. Although compared with the current imagenet, it is too small to be small, but it has an indelible contribution to the recognition of computer vision, CALTECH101 opened the precedent of target recognition, which was born a lot of interesting descriptors and object recognition Algorithms, in which the main target recognition algorithm is (1) bag-of-visual-words (BoW); (2) Template matching. Bow is inspired by the Text field topic modeling, the main idea is to randomly take some patches on the image, these patches is called visual words, the image can be seen as composed of these visual words (as an article ( Document) is made up of many words. Let's talk about a representative target recognition article:
(1) Lda:latent Dirichlet allocation, which was originally presented by Professor David Blei of Princeton in 2003 for the text of unsupervised topic modeling, in 2005, is still reading the fei-f of Bo Ei Li uses LDA to do the scene classification in vision, this is a typical bag-of-visual-words algorithm used in the target classification of the article;
(2) SPM (spatial pyramid matching), which is currently UIUC Professor LaZebnik, uses a very simple spatial grid to divide the image into pieces, and then each block is statistically bow histogram, Finally splicing these histogram together, so that the formation of image descriptors have spatial structure information, and no longer like the previous bow description of the lack of space information, very concise, but also very effective;
(3) Some improved image encoding methods based on bow: 2006-2009 years, scholars use sparse coding, Fisher vector techniques to improve the traditional bow image descriptors (image encoding), such descriptors are more Discriminative, some progress has been made, but they still belong to the method of the bow system;
(4) Pyramid matching kernel: It is presented by Professor Grauman of UT Austin, although the first step is also to extract visual words (SIFT), but it differs from bow: PMK defines a similarity kernel, through two shadows Like the extracted sift descriptors to directly describe the similarity of two images, and finally the SVM classification. Apparently each individual image in the PMK does not have its own description (encoded vector descriptor).
(5) Dpm:deformable parts model, presented by Professor Felzenszwalb in 2010, is a completely different approach to target recognition algorithm, its core idea is template matching, defined the root templates and several part Templates, then uses latent SVM to depict the relationship between root and parts, and finally, descriminative SVM parameters can be used for classification by latent training latent. DPM is the best target recognition algorithm before deep learning, followed by a number of DPM acceleration algorithms for fast target detection.
4. Automatic features This learning: deep learning in the visual popular (2010 ~ 2015)
Deep learning is again popular, breaking the pattern of target recognition algorithm, making both bow and DPM become the past, deep learning become the leader in the field of target recognition. First, bow does not have the structure information of the object at all, and then, DPM can be regarded as a kind of structure (root+part) of 2 layers, but compared with the layer of deep learning (usually 10~20 layer), it is also a shallow structure (shallow network). Deep Learning's popularity, there are 4 people can not do: Geoffrey Hinton, Yann LeCun, Yoshua Bengio and Andrew Ng.
Here we highlight the deep convolutional neural Network (CNN), which was used by Yann LeCun for handwriting recognition as early as 1990, but until 2012, CNN has not been taken seriously, two reasons: (1) The graceful theory of SVM , leading the ability to classify, making other classifiers (including CNN) eclipsed; (2) The limited computational performance of computer hardware, coupled with the absence of large amounts of labeled data, has prevented CNN from getting very good results. In 2012, Krizhevsky (Geoffrey Hinton, a student at the University of Toronto, Canada) present the results of CNN for target recognition on nips, and it directly halved the error of the best target recognition algorithm, causing uproar and heated discussion, and today, CNN has been accepted by the entire computer vision industry as a universal method of target recognition.
The structure of CNN is: N (convolution layer + pooling layer) + several fully connected layers, this deep structure of CNN is inspired by the hierarchical solution of human visual neuron recognition target Structure: lgn-v1-v2-v4-it, the simple direction information source often causes the lower level neuron to firing, but the more abstract shape stimulation source often can excite the high-rise V4 region the neuron. CNN's deep structure uses the following properties: Many natural signals are hierarchical, with complex features at the top and simple features at lower levels. The convolutional layer in CNN is a characterization of the distributed representation, while the pooling layer allows deep structure to shift the image in a small way (shifts) and deformation (dist Ortion) is not sensitive. CNN uses the error back propagation to train parameters, using a first-order random gradient descent method (stochastic gradient descent). We will explain CNN in detail from the technical details.
The progress of Computer vision 20 years (1995~2015)