This topic includes two articles, made by Microsoft study and Jamie Shotton,antonio Criminisi,sebastian Nowozin of Cambridge University. Here is the first article, the content of the second article will be posted here.
Machine vision is a technology that automatically understands picture content through computer algorithms, which originated in artificial intelligence and cognitive neuroscience in the the 1860s. In order to "solve" the problem of Machine vision, in 1966, at MIT, the issue was presented as a summer project, but people soon found it might take a long way to solve the problem. Today, 50 years later, the general Image understanding task still cannot be solved perfectly. However, significant progress has been made, and as the commercialization of machine vision algorithms has been successful, machine vision products have begun to have a wide range of users, including image segmentation (such as the ability to remove picture backgrounds from Microsoft Office), image retrieval, face detection focus, and Kinect human behavior capture. It is almost certain that the recent advances in machine vision have benefited greatly from the rapid development of machine learning in the last 15-20 years.
The first article in this topic is to explore the challenges of machine vision and introduce a very important machine learning technology--pixel intelligent classification decision tree algorithm.
Image classification
Imagine and try to answer the following question about image categorization: "Is there a car in this picture?" For a computer, a picture is just a lattice of three primary colors (red, green, blue), and the value of each color channel in the three primary colors is 0 to 255. The changes in these values depend not only on whether the object is present in the picture, but also on some disturbing events, such as camera angle, lighting condition, background and object shape. Also, one of the problems that must be dealt with is that different categories of cars present different shapes. For example, the car could be a wagon, or a pickup truck, or a sports car, which would have a big impact on pixel pixels.
Fortunately, the monitoring of machine learning algorithms provides a way to replace problems that would otherwise require manual coding to solve these multiple possibilities. By collecting the training set of pictures and the proper manual marking of each training picture, we can use the best machine learning algorithm to find which pixel patterns are related to the object to be identified and which are the interference factors. We hope that our algorithm will eventually be suitable for identifying new samples that have not been trained before, and for noise retention. We have made great strides in the development of the new machine vision algorithm and the collection of data sets in two aspects.
Pixel Intelligent classification Decision tree algorithm
The picture contains details on many levels. As we mentioned earlier, we can ask a question whether there is a specific object class (such as a car) in the entire picture. Now we can ask a more difficult question--what is included in this picture, which becomes a famous question "Image semantic segmentation": Extracting all the objects in the picture scene. such as a picture of the street scene below
You can imagine that this can be used to help you choose to edit some of the photos, or to splice a new picture, we can also immediately think of more applications.
There are many ways to solve semantic segmentation problems, but one of the most effective algorithms is pixel intelligence classification: Train a classifier to predict the distribution of each object (such as cars, streets, trees, walls, etc.) at the pixel level. This task takes the machine to learn some computational problems, especially when the image includes a lot of pixels (for example, the Nokia 1020 smartphone photographed pixel is 41 million pixels). This means that the whole computation time is a multiple of millions of of the total training and test sample of our classification task.
The scale of the problem has prompted us to look for a more efficient classification model--a decision tree (also known as a random tree or a random decision tree). A decision tree is a collection of discrete, trained decision trees, as shown in the following illustration.
Each decision tree has a root node, multiple internal "branch" nodes, and multiple leaf nodes. When testing a taxonomy, start with the root node and compute the binary "branching function", this function may be as simple as "whether this pixel is redder than its neighborhood pixel". Depending on the two-dollar decision, it will follow the branch to the left or right, and then look at the next "branch function" and repeat the operation. When the leaf node is finally reached, a stored prediction-usually a histogram containing the category label-is the output (you can also go to the recent excellent paper from Chris Burges, on the application of the improved variant decision tree in search rankings).
The beauty of a decision tree lies in the efficiency of his execution: although the path from the root node to the leaf node contains a number of possible paths, any independent test pixel passes only one path. In addition, the calculation of the branching function is conditional on previous events: for example, the classifier needs to rely on the answers to previous branch decisions to ask the correct questions. It's a lot like the 20 Q game: When you're only allowed to ask a few questions, you can quickly learn to adjust the next question you want to raise based on the answers to your previous questions.
With this technology, we have been able to successfully deal with these different issues, such as the semantic segmentation of photos, the segmentation of street scenes, the segmentation of 3D medical scans of human anatomy, the repositioning of the camera and the use of the Kinect depth camera for the division of Body parts. For Kinect, the decision tree test time efficiency is the key: we have a very rigorous calculation budget, but such calculations require the ability to handle pixels in parallel with the Xbox GPU, meaning we can adapt to this scenario [1].
In the second part of this topic, we'll focus on a hot topic--depth learning image classification--and stare at the "crystal ball" to see what happens next. Also, if you want to start machine learning in a cloud platform, visit our machine learning Center.
Thank you for your attention
Jamie, Antonio and Sebastian.
[1] The body part classification is only one phase of skeleton tracking, and the entire skeleton tracking program is performed by the Xbox's fantastic team engineer.
(Responsible editor: Mengyishan)