The fifth lecture of Professor Geoffery Hinton's Neuron Networks for machine learning mainly introduces the difficulty of object recognition and the methods to overcome these difficulties, and focuses on the convolution network used in digital recognition and object recognition.
Why object recognition is difficult
We know that it is difficult to identify objects in real-world situations, and this section introduces some of the things that are causing these difficulties.
- Segmentation: In an image, it is difficult to separate one of the objects from other objects. In real life, we humans have two eyes and our bodies can move so that it is visually easy to distinguish objects. The image is static, and an object is likely to be partially obscured by another object, which creates a lack of information.
- Lighting: The density/brightness of pixels (intensity) is determined by the brightness of the object, and the information we get from different brightness images is different.
- Deformation: Deformation of non-affine mode also makes identification difficult
- Affordances: There is also the issue of functional visibility (affordance), where many objects are defined from a use perspective, not from a visual angle, such as a chair with a variety of physical properties.
- Viewpoint: Changes in perspective can cause changes in the image, and standard learning methods are not manageable.
Achieving Viewpoint invariance
In this section we discuss Visual invariance (Viewpoint invariance). We usually have different angles of view when we see an object, so the resulting object will be biased on every pixel. These make object recognition very different from other machine learning tasks, and here we discuss several ways to try to solve this problem.
We humans are so good at perspective invariance (Viewpoint invariance) that we can't even comprehend how hard it is. This is one of the main challenges in the field of computer sensing, whether in engineering or in academia (the vocabulary used here is psychology, but I think the translation of "academic" is more appropriate) does not yet have a common acceptable solution. Several methods are listed below:
- Use redundant invariant features.
- Put A box around the object and use normalized pixels.
- Lecture 5c:use replicated features with pooling. This is called "convolutional neural Nets"
- Lecture 5e:use A hierarchy of parts that has explicit poses relative to the camera.
The first method is the Invariance feature method (invariant feature approach), which tells us to extract a large number of redundant features, which are invariant when transformed by rotation and scaling. When there are enough invariance features, there is only one way to combine these features into one object. (We don't need to be clear about the relationship between features, because these relationships are already included in some features) but for recognition we have to avoid extracting features from parts of different objects.
The second method we call the judicious normalization approach. Use a box to circle an object and use it as a coordinate frame for a standard set of pixels. This solves the dimension-hopping problem, and if the selected box is correct, then we can make the consent portion of the object always correspond to the same block of standard pixels of the image. In addition, the box can provide invariance for many different degrees of freedom: translation, rotation, scale, shear, stretch, and so on. However, it is very difficult to choose a box, because there may be some problems such as segmentation error, covering, singular angle of view and so on.
The method of brute force generalization (the Brute forces normalization approach) is given.
The third and fourth methods are described in the following subsections.
convolutional nets for digit recognition
This section introduces the use of convolutional networks in digital recognition, and the deep convolution network has achieved high precision in handwritten numeral identification and has been applied to practice.
Convolutional neural networks are based on the idea of replication features (replicated features), and if the feature detector (feature detector) is valid in one part of the image (useful), then the detector is likely to be valid in other parts of the image as well. Therefore, the idea of volume and neural network is to establish the different replication of unified feature detectors in all different locations of images. An example is given on the right, with three detectors dealing with different parts of the image, each with a weight parameter corresponding to nine pixels, and three detectors with the same parameters. We can try to replicate across scale and direction, but thankless; cross-location replication has greatly reduced the number of free parameters to learn (for example, 27 pixels for three detectors produce only 9 parameters). We can use a variety of mappings, each of which has the same feature copy, and the features in different locations are the same.
The inverse propagation is a linear limitation of the implementation of weights.
It is very confusing to know exactly what kind of replication features the detector is implementing (achieve). Many people think of translational invariance (translation invariance), but the actual replication feature is equivariance rather than invariance. Gives a description of equivariant activities. The invariance is knowledge: If you know how to search for features in one part of the image, then you know how to explore the same features in other parts. That is to say, in the Act (activities) We achieve the equivariance, and in the weight (weights) we achieve is invariance.
If we want to achieve invariance in our behavior, then we need to do it in time pooling (pooling). Pooling reduces the input of the next layer, greatly reducing the number of parameters to be asked. But after a lot of pooling, we lose some information. For example, in an image through pooling we can roughly detect the eyes, nose, mouth, then we will be able to determine that the image contains a face. But if we want to determine who this picture is, we need to get the information about the position of the eye, nose, mouth, etc., and this information may be lost when pooling.
The following describes the Le Net proposed by Yann LeCun and his team, a handwritten digital recognition system that uses reverse propagation in the forward feedback network.
is a structural diagram of the LeNet5. Enter a sheet 32 x Span style= "Font-family:mathjax_main; padding-left:0.222em; "id=" mathjax-span-5 "class=" mn ">32 Image, the C1 layer is obtained from the input layer is 6 different 28 x Span style= "Font-family:mathjax_main; padding-left:0.222em; "id=" mathjax-span-10 "class=" mn ">28 The original image of the local mapping, S2 layer from the C1 layer to get 6 different 14 x 14 Map, so proceed until you get the output.
Of the 100,000 test cases, LENET5 only made 82 mistakes, listed the 82 sets of data, you can see such data we use the eye to identify is not necessarily correct.
is the treatment of prior knowledge.
The lower the error rate, does it mean that the performance of the model has improved, and the answer is no. This depends on the type of error the model has made and gives an example. We only look at the red numbers in the table, the same training time in the lower table below the Model 1 correct and the Model 2 error test data only 1 groups, and model 1 error Model 2 correct test data have 11 groups, it is obvious that the Model 2 is better than the Model 1. In the lower right table, the training time of Model 1 and Model 2 is 40hours and 30hours respectively, while the error ratio of the two is 25:15, which shows that the time of training of Model 2 is shorter and less than that of model 1.
Convolutional nets for object recognition
In this section we use convolutional neural networks to achieve object recognition. The handwritten numbers in the previous section were originally 2-d, and the real-D objects lost a lot of information when converted to 2-d images, which added a lot of difficulty, listing some of the additions from the 2-d handwritten numerals to the three-D objects.
The competition on Imagenet is mentioned here, which provides 1,200,000 high-resolution images. The goal of the classification is to give an image, and the model finally gives the 5 possible tags that contain the correct tag (1000 5). The goal of positioning is to have at least 50% overlap in the object's actual position given by the model.
Here is an example of CNN in Imagenet, as shown in.
Two tips for improving generalization capabilities are listed.
Finally, the hardware is mentioned.
Finally, a problem of identifying roads in high-resolution images is also mentioned.
Here's a great God. Introduction to convolutional neural networks
Neural networks used in machine learning v. Notes