The Kinect microphone array is below the Kinect device. This column consists of 4 independent microphones horizontally distributed below the Kinect. Although each microphone captures the same audio signal, the array can detect the source direction of the sound. So that it can be used to identify the sound coming from a particular direction. The audio data stream captured by the microphone array is processed by a complex audio enhancement algorithm to remove the irrelevant background noise. All of these complex operations are handled between the Kinect hardware and the Kinect SDK, allowing for a large space to be recognized by voice commands even if the person is a certain distance from the microphone.
Skeleton tracking and speech recognition were the most popular features of the Kinect SDK when it was first released by the Kinect as a Xbox360 peripheral, but the power of the microphone array in speech recognition was overlooked in comparison to skeleton tracking. This is partly attributable to the exciting skeleton-tracking system in Kinect, and partly to the fact that the Xbox game dashboard and the Kinect game do not give full play to the benefits of Kinect audio processing.
As a developer starting to use the Kinect for application development, the advent of the Kinect microphone array makes the functionality based on the Kinect application more powerful. Although the Kinect's visual analysis is impressive, it still does not control the motor very well. When we switch from a human-computer interface to another human-computer interface: From a command-line interaction application to a tabbed interface, to a mouse graphical user interface or to a touch interface, each interactive interface provides the most basic and easier to implement, which is the choice. Further, each interaction interface improves our ability to select objects. Oddly, the Kinect has undermined this trend.
In the Kinect application, selecting operations is one of the most complex and difficult to master behaviors. The initial selection operation in Xbox360 is by placing the hand in a specific position and then holding it for a period of time. In the "Dance of the Forest" game, a brief pause plus sliding operation to make a little improvement in the selection operation. This improvement has also been applied to the Xbox's operator panel. Other actions to improve the selection include a particular gesture, such as lifting the arm.
These problems can be solved relatively simply by combining speech recognition commands with skeleton tracking systems to create a compound pose: keep an action and then execute it through speech. The menu can also be designed by showing the menu item first, and then letting the user say the name of the item to choose from-many Xbox games have been used in this way. Predictably, whether it's a program developer or a gaming company, this composite solution will be applied more and more in the future to the new interactive approach, without having to use the point and click As before to choose.
1. Microphone array
Speech recognition components are automatically installed after the Microsoft Kinect SDK is installed. The Kinect microphone array works on some speech-recognition class libraries that are available from the Vista system. They include voice capture DirectX Multimedia objects (DirectX Media object,dmo) and speech recognition APIs (Speech recognition Api,sapi).
In C #, the Kinect SDK provides an encapsulation of speech capture DMO. Speech Capture DMO was originally designed to provide APIs to microphone arrays to support functions such as echo cancellation (acoustic echo CANCELLATION,AEC), Automatic gain control (automatic gain CONTROL,AGC), and noise suppression ( Noise suppression). These features can be found in the audio control class of the SDK. Audio processing in the Kinect SDK is a simple encapsulation of speech capture DMO and is optimized for the Kinect sensor. To enable speech recognition using the Kinect SDK, the automatically installed class library includes: Speech Platform API, Speech Platform SDK and Kinect for Windows Runtime Language Pack.
The speech recognition API simplifies the class libraries required by the operating system's own speech recognition. For example, if you want to add some voice commands to your desktop application through a regular microphone instead of the Kinect microphone array, you can use or not use the Kinect SDK.
The Kinect for Windows Runtime Language Pack is a series of language models used to interoperate between the Kinect SDK and speech recognition API components. Just as the Kinect skeleton recognition requires a large number of computational models to provide decision tree information to analyze node locations, the speech recognition API also requires complex models to assist in interpreting the language models received from the Kinect microphone array. The Kinect language Pack provides these models to optimize the recognition of voice commands.
1.1 MSR Kinect Audio
The process of audio processing in Kinect is done primarily by kinectaudiosource this object. The main function of the Kinectaudiosource class is to extract the original or processed audio stream from the microphone array. Audio streams may be processed by a series of algorithms to improve audio quality, including noise reduction, automatic gain control, and echo cancellation. Kinectaudiosource can be configured to allow the Kinect microphone array to work in different modes. It can also be used to detect what kind of audio information comes in that direction first to reach the microphone and to force the microphone array to accept the specified direction of audio information.
This section will try not to introduce low-level techniques for some audio processing technologies. However, in order to use Kinectaudiosource, understanding some of the terms in speech capture and voice transmission may be helpful in familiarizing yourself with some of the properties and methods in Kinectaudiosource.
Echo Cancellation (acoustic echo cancellation, AEC) when the user's voice returns from the microphone, an echo is generated. The simplest example is when a user can hear his or her voice on the phone, and there are some delays that can be repeated over time. echo cancellation removes the echo by extracting the voice pattern of the speaker and then selecting specific audio from the audio received from the microphone according to this pattern.
echo Suppression (Acoustic echo suppression, AES) it refers to a series of algorithms to further eliminate the legacy of AEC processing echoes.
AGC (Acoustic gain control, AGS) it involves algorithms to keep the user's sound amplitude consistent with time. For example, when a user approaches or or is away from the microphone, the sound becomes louder or softer, and the AGC algorithm makes the process more pronounced.
Beamforming (beamforming) refers to the algorithm technique of simulating directional microphone. Unlike only one microphone, wave-velocity technology is used in microphone arrays (such as the microphone array on the Kinect sensor) to make the microphone array produce the same effect as using multiple fixed microphones.
The Central sharpening (center clipping) is used to remove small echoes that remain in a one-way transmission after being treated by AEC.
Frame size AEC algorithm for processing PCM audio samples is a frame-by-frame processing. The frame size is the size of the audio frame in the sample.
Get the gain boundary (Gain bounding) This technique ensures that the microphone has the correct gain level. If the gain is too high, the obtained signal may be too saturated and will be cut off. This kind of shearing has non-linear effect, which will make the AEC algorithm fail. If the gain is too low, the signal-to-noise ratio will be low, which will make the AEC algorithm fail or perform poorly.
Noise Filling (noise Filling) adds a small amount of noise to a part of the signal removed from the residual echo signal to the center-cut wave. This allows for a better user experience than leaving a blank silence signal.
The noise suppression (NS) is used to remove nonverbal sounds from the audio signals received by the microphone. By removing the background noise, the actual speaker's voice can be captured more clearly and clearly by the microphone.
The Optibeam Kinect sensor is capable of obtaining 11 beams from four microphones. These 11 beams are logical structures, and four channels are physical structures. Optibeam is a system model for beam forming.
Signal-to-noise ratio (signal-to-noise Ratio,snr) signal noise ratio is used to measure the ratio of speech signals to the overall background noise, the higher the Snr the better.
One-channel (single Channel) Kinect sensor has four microphones, so it supports 4 channels, single channel is a system mode to turn off beam forming.