GMM-UBM System Framework
The original feature was the acoustic feature MFCC, which I did not know very well, except that it was presented directly from WAV or other format voice files.
With the feature you can build the model, where our model is called "Gaussian mixture model". The difference between different speakers is mainly expressed in the short-term phonetic spectrum, which can be measured by the probability density function of each speaker's short-time spectral feature. The Gaussian mixture model GMM uses the weighted sum of multiple Gaussian probability density functions to fit the probability density of the spatial distribution, and can approximate the probability density function of arbitrary shape smoothly, and is an easy-to-handle parameter model. In the concrete representation, this model is actually the Gaussian mixture model of each Gaussian component of the mean vector together to form a hyper-vector as a model of a speaker, known as the mean super-vector.
However, usually in practice, each speaker's voice data is very small, and training Gaussian mixture model needs a lot of training data, how to do? Thus, the UBM universal background model was raised. During the training of the Speaker model, due to the sparse data of the speaker during registration, a common background model (Universal Background MODEL,UBM) and a small number of speaker data are usually used, through an adaptive algorithm (such as maximum posterior probability map, maximum likelihood linear regression MLLR, etc.) ) Get the target speaker model.
OK, the features and models are all set up, how do we test them? Here, a logarithmic likelihood ratio is used to evaluate the index. The test data is compared with the model and the UBM respectively, then the two likelihood division is then taken logarithm, and the obtained value is used as the score to evaluate whether a test data matches the model.
How do you understand this scoring standard? Because the UBM represents the most common ordinary phonetic feature, the model represents the characteristic that belongs to this speaker. The logarithmic likelihood ratio is used to evaluate whether the test data is closer to the model or closer to the UBM. Finally, a threshold value is set for the final classification judgment.
Joint Factor Analysis
The GMM-UBM system above is very classic, before generally as a baseline system for speaker recognition. However, this system does not solve the most troublesome problem in the field of speaker recognition, that is, the channel is robust. Information can be consulted on the complexity of channel robustness. Therefore, some people put forward the application of factor analysis to the Speaker field.
The combined factor analysis suggested that the Gaussian model mean hyper-vector in the GMM-UBM system could be divided into the linear superposition of the vector features related to the speaker itself and the vector features associated with the channel and other changes. In other words, the space of the speaker GMM is divided into Eigen space, channel space, and finally a residual space. In this way, if we can extract the characteristics associated with the speaker itself and remove the characteristics of the channel, we can well overcome the channel influence to identify. This idea is proved to be true, and the performance of the system is obviously improved after the combination factor analysis is used.
Speaker recognition based on I-vector feature
The traditional joint Factor analysis modeling process is based on two different spaces: the speaker space defined by the eigen-sound space Matrix, and the channel space defined by the Eigen-channel space matrix. Inspired by the theory of Joint factor analysis, Dehak proposes to extract a more compact vector, called I-vector, from the GMM mean hyper-vector. Here I is the meaning of identity, out of natural understanding, i-vector equivalent to the speaker's identity.
The I-vector method takes a space instead of these two spaces, the new space can become the global difference space, it contains the speaker between the differences and includes the differences between the channels. Therefore, the modeling process of i-vector is not strictly differentiated between the speaker's influence and the channel influence in GMM mean hyper-vector. The motive of this modeling method comes from another study of Dehak: The channel factor after JFA modeling not only contains the channel effect but also the information of the speaker.
So now, the main feature we use is i-vector. This thing is obtained by Gosshu vectors based on factor analysis. This thing is a cross-channel algorithm based on a single space, which contains both the speaker space information and the channel space information. The equivalent of using the factor analysis method to project speech from high-level space to low-dimensional.
You can think of i-vector as a feature or as a simple model. Finally, in the test phase, we just calculate the consine distance between the test speech i-vector and the i-vector of the model, which can be used as the final score. This method is also commonly used as a baseline system based on the I-vector speaker recognition system.
Channel compensation algorithm
In fact, channel compensation related work has been studied since the Speaker recognition field, including the above GMM-UBM system and the joint Factor Analysis system. Channel compensation is mainly divided into three levels: feature-based compensation, model-based compensation and score-based compensation. Since all the aspects of my research are based on the I-vector features, the emphasis here is on the channel compensation algorithm based on the I-vector feature.
Why do we need channel compensation? In front of the I-vector said, the I-vector feature contains both the speaker information and the channel information, and we only care about the speaker information. In other words, because of the existence of channel information, we do the speaker recognition has caused interference, and even seriously affect the accuracy of the system recognition. So, we will try to minimize this effect. This is known as channel compensation.
Linear discriminant Analysis Lda
There are many channel compensation algorithms, first of all, to say LDA. There is a lot of information about LDA, so here's a brief look at why LDA can be used to identify the speaker and how to channel compensation.
When a speaker has a lot of voice, the expression of these sounds in the speaker space clustered into a cluster. If these voices receive the influence of the channel, then it is shown that the speaker's voice has a large variance. Then, Lda tries to find a new space, projecting all the original data into this space, so that the same speaker's data in this space has the smallest intra-class variance, while the distance between the speakers is as large as possible. In this way, the effect of reducing channel difference is achieved.
LDA is actually a dimensionality reduction method. It tries to remove the unwanted direction and minimizes the amount of variance within the class. That is, LDA is looking for a new space to better classify different classes. It can be seen that LDA is well suited as a channel compensation algorithm for speaker recognition systems.
When the i-vector of the test data and model are re-projected using LDA, then the cosine distance between them can be calculated as the final score.
Copyright NOTICE: This article for Bo Master original article, welcome reprint, but please specify the source ~
A summary of my research on speaker recognition/voice recognition