Speech recognition is not a fresh topic, but in the last two years it has been discovered that there has been a sudden influx of applications. such as Siri on the iphone, search engines and shopping sites for voice searches. Recent large-scale applications may be rooted in the innovation of mobile Internet, the popularity of smartphones and the hardware base provided by network connectivity (such as 3G/4G) is an important precondition.
Currently on the market applications generally take voice upload, server-side analysis of the way. including Google and Apple. And their recognition efficiency and accuracy are impressive indeed. I have tried Google Speech API, the recognition rate is equally amazing (https://www.google.com/speech-api/v2/recognize?).
Server-side parsing has some drawbacks. Some time ago a TV in the description told the user: all your voice will be uploaded to the cloud. It scares some users. But at this stage this is the only possible way, because high-quality speech recognition does require a lot of storage space and computing power.
CMU Sphinx is the only successful open source speech recognition Project I know of. Kai-Fu Lee is the author of the first edition. When I first downloaded CMU sphinx, I was very disappointed with its recognition accuracy. After a few searches I realized that speech recognition is very complex. It involves voice models, language models, and so on. Voice model requires hundreds of people, dozens of hours of voice information per person to produce a more general voice library. And the language model is also very complex, with the increase of vocabulary, to guess the voice corresponding to the difficulty of the text will also increase rapidly. In other words, it is very difficult to identify different people and to recognize a large number of words at the same time. This also involves participle, as well as context. Not to mention that we usually talk mixed in English. Google and Apple's systems are likely to use several T-even larger storage spaces to store speech models and language models. and to have the corresponding calculation ability fast or real-time calculation.
Therefore, it is unrealistic to realize high-quality large vocabulary speech recognition on the mobile terminal at this stage. It may be possible to wait until our phone has several T storage spaces. By then, the whole field of AI might have had a huge impact on our lives.
But this does not mean that CMU Sphinx has no practical use. The application described above is full vocabulary recognition, but there are occasions where there is no need for such a large vocabulary. For example, we use Raspberry Pi to make a small car, as long as support "forward", "back", "left", "right", "stop" and so on a few words can be. At this time the language model is greatly simplified, and the recognition rate increases correspondingly. Voice Model Sphinx Some of the band, in addition to training for their own voice model, Sphinx have tools to train, as long as the screen of the word recording can be.
CMU Sphinx offers an online simple language model generation tool (http://www.speech.cs.cmu.edu/tools/lmtool-new.html) that can help you generate language models.
Detailed instructions on how to use the CMU Sphinx to get started on the official website. Java programmer Look at this: (http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4)
Recently, the CMU Spinx has an article on the status of offline identification, with detailed instructions on the status of offline use.
http://cmusphinx.sourceforge.net/2015/02/current-state-of-offline-speech-recognition-and-smarttv/
How to use speech recognition in a simple way?