Voice Command Data set address: http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
Audio Recognition Tutorial Address: https://www.tensorflow.org/versions/master/tutorials/audio_recognition
At Google, we are often asked how to use deep learning to solve speech recognition and other audio recognition problems, such as detecting keywords or commands. Although there are already many large open-source speech recognition systems, such as Kaldi, these systems can use neural networks as a module, but their complexity makes it difficult to guide simple tasks. More importantly, there are not many free, open source datasets for beginners (some datasets need to be preprocessed before building a neural model) or a dataset for simple keyword detection tasks.
To address these issues, the TensorFlow and AIY team created a voice command dataset and used it to add training and inferred sample code to the TensorFlow. The data set has 30 short words of 65,000 lengths for 1 seconds of pronunciation, which are provided by thousands of people via the AIY website. It is released with Creative Commons by 4.0 license and will continue to release new versions as audio grows. The dataset is designed to help build a basic but useful application voice interface, including commonly used words "yes" (yes), "no" (no), numbers, and directional words. We also open up the infrastructure for creating the dataset, and want more people to use it to create their own datasets, especially for languages and applications where service levels are low.
To try it yourself, download the pre-set dataset for the TensorFlow Android demo app (Http://ci.tensorflow.org/view/Nightly/job/nightly-android/ lastsuccessfulbuild/artifact/out/tensorflow_demo.apk) and open "TF speech". You can apply for access to the headset, and then you'll see a list of 10 words, which will light up when you say which word.
The recognition results depend on whether your voice mode is overwritten by the dataset, so this is not perfect, and commercial speech recognition systems are much more complex than this teaching example. But we hope that as more accents and variants are added to the dataset, and the community contributes to TensorFlow's improved model, we can see the continuous improvement and expansion of the data set.
You can also learn how to train your own models by tensorflow.org on the new audio recognition tutorials. With the latest development version of the framework (https://hub.docker.com/r/tensorflow/tensorflow/) and modern desktops, you can download the dataset and train the model within a few hours. You also have a variety of options to customize neural networks for different problems, resulting in different latency, scale, and precision balances to suit different platforms.
We look forward to seeing new applications built with the help of this dataset and tutorials, so I hope you have the opportunity to take advantage of these resources and start doing audio recognition tasks.
The Convolutional neural networks for small-footprint, presented at the interspeech 2015 meeting keyword spotting "(http://www.isca-speech.org/archive/interspeech_2015/papers/i15_1478.pdf) describes the architecture of the network.