Speech recognition and speaker recognition-a short encounter
This article mainly summarizes the experience of learning speech recognition...
First knowledge
When I was a graduate student, I was focusing on low-bit-rate Speech Encoding Technology. However, I have heard that speech recognition is a very remarkable technology. As you can imagine, it would be exciting to have a conversation with a machine that is accessible. At that time, I simply learned some terms and explanations and admired Rabiner, Lee Kaifu and CMU (Carnegie Mellon.
2. Learning
After working for a while, the company seems to be able to do everything about Speech Encoding, so it is planned that the future direction of development will be speech recognition. This really makes me feel excited for a while, actively participate in this plan and serve as the main developer. The next step is to search for learning materials related to speech recognition on the Internet, including books, papers, and source code. In just a few days, we have made it clear that the development direction of speech recognition is non-specific, small-vocabulary, and speech recognition for isolated words. After all, we basically start from scratch. Many concepts need to be understood, many algorithms need to be understood, and many experiences need to be accumulated.
First, we chose the HTK development tool, which is a set of speech recognition tools provided by Cambridge for reference. We chose HTk to learn the basic principles and procedures of speech recognition. The learning materials and source code provided by HTk are still very good. It took us three months to learn the basic concepts and development processes, and to experiment with feature extraction, model matching, and hidden Markov model, especially Hidden Markov Model (HMM) the ball and urn experiment clearly explains the concept of HMM. In the next three months, based on our own embedded platform applications, we developed a small vocabulary, non-specific, isolated word recognition application scheme and supporting tools for toys, processing a corpus is a very time-consuming task.
Three applications
Next, we also found several third-party speech recognition solutions for comparison, from which we learned what technology accumulation and engineering applications are. After comprehensive evaluation, there are still some gaps in our own solutions, which we expect. After all, third parties have been doing this in this field for more than a decade. Objectively speaking, there is no way to compare our various resource investments with those of third parties. Later, the company planned to turn to speaker recognition. With the previous Foundation, it was not too difficult to learn speaker recognition. In fact, the main difference was the extraction of feature parameters, the feature parameters of Speech Recognition emphasize the speech content and suppress the styles of different speakers. Speaker Recognition (Speaker Recognition) focuses on the differences between speakers and does not care about the speech content. In addition, we use Gaussian mixture model (GMM) for model matching, and support vector machine (SVM) technology for Binary judgment for rejection outside the training set.
Si died
The company's change is faster than the plan. Due to resource restrictions, the entire plan is stranded.
Summary
It is not easy for Speech Recognition to love you. It is better to have an experienced person to take less detours, but this is also hard to find, for individual learning, you need a solid mathematical foundation (especially in terms of Probability and Statistics), determination and perseverance, and passion and interest.
One of the experiences in learning and developing speech recognition is:
1. High-quality corpus should be available in different scopes.
Speech recognition is essentially pattern recognition. It requires certain learning to extract key feature parameters and then use these parameters for matching during recognition. Therefore, it is very important to provide learning (training) corpus. For example, if you want to recognize what Northerners say, You need to record the corpus of northern men, women, elders, and children for training in proportion. If the object to be recognized includes the north and the southerner, you must include the corpus of the north and the southerner when training the model. The reason is very simple. A northerner must not be able to understand the voice of the southerner for the first time. But after several times of communication and learning, he gradually got used to it and understood it. Similarly, in order for the machine to understand what the white man says, it is necessary to adapt and learn more in advance. Therefore, the larger the difference in the recognition object, the more complicated the corpus content matching.
2. Engineering Applications are more complex than general learning research
The complexity is not an isolated knowledge point, but a system that can face different environments and objects. How to develop solutions that meet the customer's needs as much as possible under the constraints of limited embedded system resources; how to improve the speech recognition rate in a noisy environment; how to deal with the system when identifying objects fails intentionally, the knowledge in books cannot be simply converted into the Application of actual projects. The knowledge is not equal to the skill, not to the customer's application.
Although the speech recognition project has been suspended, many things have been learned from it. Although there are still many difficulties that limit the application of speech recognition, however, there are still many scholars and fans who are constantly breaking through speech recognition technology. In addition, there is actually no perfect technology, and there are only application solutions that meet the customer's needs.
Finally, I still think that speech recognition is a very practical technology, because direct speech communication is one of the most natural ways of human-computer interaction.