"Editor's note" the author of this article, David Cardinal,david, has more than 20 years of experience in technology industry. He is a collaborative developer for Windows Digitalpro, the first professional image management solution in Windows. This article is a combination of Microsoft's development experience at this year's build convention to understand the development of the Kinect and leap Motion's body sense.
David Holz, the CTO of Leap Motion, told me for the first time not to put too much effort on their hardware, because most of the know-how was in the software--I was skeptical. And then at Microsoft's Build convention, Alisson Sol, the head of the Kinect development, described in detail how the first generation of Kinect evolved to this day, and I was finally convinced.
The focus of somatosensory development is software. The hardware, of course, is sufficient, but it does not need top-notch. From the split leap, it comes with just 3 off-the-shelf LEDs and 2 normal-price cameras. Most of the hard work of the development team is how to make the algorithm accurately identify the user's actions. Microsoft's grip action recognition on Kinect is a good case for research on the development of body sense.
Machine learning
Over the years, the code farmers have spent a lot of time saving computational resources with ingenious, sophisticated algorithms. and machine learning to some extent to the past. By machine learning, we can now throw a large amount of data to the computer for its own calculation, while taking up only a very low CPU and GPU. Of course, it's a lot more complicated than that, and it's not easy to do it well. But the results will apply to a large class of recognition--including body sense.
Before you start machine learning, the first thing you need to do is collect a lot of high quality data. The example of the Kinect is a number of gigabytes of depth-tagged video, in which multiple testers make various gestures and motions. These video clips then need to be manually labeled to indicate which node the tester is doing, a process similar to the traditional data collection process. Finally, the data of these human meat markers are called "underlying facts"-the criteria by which the recognition process is based.
At the same time you have to pay attention to a problem is clear direction. The Microsoft team used to determine whether the shape of the palm was open or closed in the grip action recognition but for a long time there was no progress, and Sol's team used the method of directly judging grasping and letting go--that's the key to improving the grip interface.
Turning data into features
Once you have enough tagged data, the next step is to decide which attributes (or attributes) of the data to use to determine a gesture. It's not just technology, it's art-it's really not easy to grasp. The features you use to make judgments are also easy to compute, such as the Windows Kinect team, which has only 2 milliseconds to identify grip movements.
In this motion recognition, the team first uses calculates the distance palm the pixel quantity (by the skeleton Detection subsystem gathers) as the machine learning algorithm main characteristic. But then they found a more troublesome point, the position of the hand is not stable so this feature is difficult to accurately capture. So they had to develop an auxiliary algorithm for this purpose--taking into account the position of various hands and the displacement in the process of recognition. Unfortunately, this method is ineffective.
Finally, Sol use the depth tag data, grip or let go when the image frame and the pixel difference between the frame to judge this action, that is, each pixel point than the previous frame of the change will participate in the joint determination of the occurrence of action.
Let your output to complete the code
Unlike common results-oriented programming, machine learning systems-such as the Kinect system-rely on a set of original ideal outputs (that is, initially tagged data) to generate recognizers (machine-generated code). The generated recognizer can recognize the target gesture in practical application.
At the same time, you will soon find that calculating the attributes selected in the previous step becomes a large data problem. The 30 frames per second of the test video have more than 100,000 frames per hour, and the number of pixels per frame is about 300,000 (more new Kinect).
Even if you just focus on the 128x128 area around your hand, there are more than 16,000 pixels to analyze on each hand-64,000 in each frame (4 hands). You then enter the extracted features into the machine learning system (which should be called variants of many open systems).
Sol's not going to tangle with us. The difference between the various machine learning algorithms is simply to say that as long as there is enough data, different algorithms will get similar results (this is easy to understand). In their project, they used the ID3 algorithm to create the decision tree. The ID3 algorithm calculates the information gain for each property and selects the property with the highest gain as the Test property for the given collection.
If the initially selected feature is sufficient to complete an action-recognition decision, the system-generated code runs over more "underlying facts." Instead, you need to withdraw to this step of feature selection.
Don't rush to test, analyze first
Many of the research papers on machine learning end up with a test method such as "finger pointing up, finger off". But for consumer-grade products such as the Kinect, Sol says this is far from enough. To reach the high standards of the market, Microsoft has used thousands of kinds of test items and developed tools to analyze different types of errors, thus turning back to improve the algorithm. He used the speed of his hands for example: it is clear that when the hand is moving fast, the position of the captured hand is much more skewed-so the recognition algorithm for grasping action needs to be taken into account.
In the test, because you want to analyze a large number of frames, even if the correct recognition rate reaches 99.9%, the result of the test will be a large number of errors per hour. Each update for these glitches requires several changes to the recognizer, which is the code used for action recognition.
One of the updates that Sol mentions is the distinction between the image recognition of the right-hand hand, which cannot be treated as a mirror, because the light and shadow are asymmetrical.
You can imagine it takes a long time to run these tests, Sol says it takes a week to test a grip recognizer with 80-core devices.
Finally, the Kinect team also listed Microsoft's Help on how to improve the speed of the recognizer. The end result is that Kinect provides grip control in Windows SDK 1.7, which is useful and instructive for developers to use. Similarly, although leap is not as straightforward as Kinect in its development process, it is clear that its software implementation also turns a bunch of ready-made components into one of the most powerful body-sensing devices on the market.
Via:extremetech
Related:
Wi-Fi realizes gesture recognition.