Feature Extraction
MFCC
compute-mfcc-feats.cc
Create MFCC feature files.
Usage: compute-mfcc-feats [Options ...] <wav-rspecifier> <feats-wspecifier>
Where parameter rspecifier is used to read a. wav file, Wspecifier is used to write the resulting MFCC feature. In a typical application, the feature is written to a large "archive" file, and an "SCP" file is written for random access. The program does not extract the Delta feature (add-delats.cc).
Its –channel parameter is used to select the stereo condition (–channel=0,–channel=1).
Compute-mfcc-feats--config=conf/mfcc.conf \
scp:exp/make_mfcc/train/wav1.scp \
ark:/data/mfcc/raw_ Mfcc_train.1.ark;
The first parameter, "SCP: ...", is used to read the file specified by the EXP/MAKE_MFCC/TRAIN/WAV1.SCP. The second parameter, "Ark: ...", indicates that the computed features are written to the archive file/data/mfcc/raw_mfcc_train.1.ark. Each sentence in the archive file is a characteristic matrix of N (frames) xn (MFCC).
The calculation of the MFCC feature is done in the compute method in the object MFCC, and the calculation process is as follows:
1. Traverse each frame (usually 25ms one frame, 10ms sliding)
2. For each frame
A. Extracting data, adding optional perturbations, pre-emphasis and de-DC, plus windows
B. Calculate the energy of the point (using logarithmic energy, not C0)
C. Do the FFT and calculate the power spectrum
D. Calculate the energy of each Mel frequency point, a total of 23 overlapping triangular frequency points, the center frequency according to the Melping domain evenly distributed.
E. Calculating logarithmic energy, making discrete cosine transforms, preserving the specified number of coefficients
F. Weighting the cepstrum coefficients to ensure that the coefficients are in a reasonable range.
The upper and lower limits of the Delta Mel frequencies are determined by –low-freq and –high-freq, and are usually set to close to 0 and Nyquist frequencies respectively. such as for 16KHz voice –low-freq=20,–high-freq=7800.
You can use copy-feats.cc to convert features into other formats. normalized mean and variance of the cepstrum
This normalization is usually done to obtain a normalized feature cepstrum based on the speaker or the 0 mean value based on the spoken statement of the unit variance. However, it is not recommended to use this method, but instead use the model-based mean and variance normalization, such as linear vtln (LVTLN). You can use a small, phoneme-based language model for fast normalization. The feature extraction Code compute-mfcc-feats.cc/compute-plp-feats.cc also provides the –substract-mean option to obtain a 0 mean feature. If you want to get a normalized feature based on speaker or sentence-based mean and variance, you can use
compute-cmvn-states.cc or apply-cmvn.cc programs.
Compute-cmvn-stats.cc will calculate all the statistics required for the mean and variance and write the statistics to the table as a matrix.
Compute-cmvn-stats
--spk2utt=ark:data/train/train.1k/spk2utt \
scp:data/train.1k/feats.scp \
ark:exp /mono/cmvn.ark;
Mono Vegetarian Training
Initialize the mono-voxel model
-gmm-init-mono.cc
The program has two inputs and two outputs. As input, it is necessary to describe the topological files of the HMM structure of the acoustic model (such as Data/lang/topo) and the dimensions of each component in the Gaussian mixture model.
Such as
Gmm-init-mono data/lang/topo \
exp/mono/0.mdl \
exp/mono/tree;
In this case, the tree is the contextual decision trees.
-compile-train-graphs.cc
Let's say we've got the decision tree and the model. The next command creates the HCLG diagram archive for the training set. The program compiles FST for each training statement.
Compile-train-graphs exp/mono/tree \
exp/mono/0.mdl \
data/l.fst \
ark:data/train.tra \
Ark : exp/mono/graphs.fsts;
Train.tra is the index file for the training set, and the first field in each line of the file is the Speaker ID. The output of the program is graphs.fsts, which establishes a binary format FST for each sentence in the Train.tra. The FST and HCLG correspond, the difference is that there is no transfer probability is not inside. This is because the graph will be used more than once in training and the transfer probability will change, so the transfer probability is added later. But the FST in the archive contains the probability of silence (encoded to L.FST). Decoding diagram is hclg=h∘c∘l∘g hclg=h \circ C \circ L \circ G.
The 1.H includes a hmm definition; the output symbol is a context-independent phoneme, and the input symbol is a series of state transition IDs (the encoding of probability IDs and other information).
The 2.C represents a contextual dependency. Its output represents the symbol of the phoneme, and the input is the context-dependent phoneme symbol;
3.L is a dictionary (Lexicon); the output is a word, and the input is a series of phonemes.
4.G is a grammar, a finite state machine for speech models.
Since there is no symbolic disambiguation in the training, it is easier to create a diagram than a test, and the training and testing uses the same HCLG form, except that the G includes only a linear receiver associated with the training corpus.
It is also hoped that the HCLG is random and the traditional method uses the "Push-weights" method. Training Mono-Voxel model
-align-equal-compiled.cc
Given acoustic model/Figure rspecifier/feature Rspecifier. This program returns the Wspecifier used for alignment. This is the e-step in the EM algorithm (see http://blog.csdn.net/shichaog/article/details/78415473 for the EM algorithm). The alignment is an integer vector.
align-equal-compiled 0.mdl \
graphs.fsts \
scp:train.scp \
If you want to see the alignment results, you can use the show-alignments.cc program to view them.
-gmm-acc-stats-ali.cc
This program has three inputs: 1. Compiled acoustic model (0.MDL), 2. Audio file characteristics for training (MFCC,TRAIN.SCP) 3. The alignment information of the previously calculated hidden state. The output file is a state set (0.ACC) for GMM training.
Gmm-acc-stats-ali 0.mdl \
scp:train.scp \
ark:equal.ali \
0.ACC
-gmm-test.cc
This is the M step of the EM algorithm, given 1. acoustic model; 2.GMM Training State set, this program will output a new acoustic model (updated by ML estimates).
Gmm-test--min-gaussian-occupancy=3 \
--mix-up=250 \
exp/mono/0.mdl \
exp/mono/0.acc \
exp/mono/1. mdl
The parameter--mix-up specifies the number of new mixed Gaussian model components.
When the amount of training data is hours,--min-gaussian-occupancy needs to be specified to handle rare phonemes. phoneme and data alignment
-gmm-align-compiled.cc
Given 1. acoustic model; 2. rspecifier;3. Rspecifier of the figure. The program returns the alignment of the wspecifier. This is the e-step in the EM algorithm. Alignment refers to the HMM state and the extracted phonetic feature vector relationship. Each of the HMM states has a Gaussian distribution output, and the aligned eigenvectors are used for Gaussian parameter updates (Μ\MU and ∑\sum).
gmm-align-compiled 1.mdl \
ark:graphs.fsts \
scp:train.scp \
Ark:1.ali;
three-phoneme training
construction of context-dependent phoneme decision number
CART (clustering and Regression Tree).
-acc-tree-stats.cc
The program has three input parameters 1. Acoustic model, 2. Acoustic characteristics of the rspecifier,3. The rspecifier of the previous alignment. The return value is a tree cumulative amount.
The program not only handles mono alignment, but also handles context-based alignment (such as San Yinsu), the statistics required by the build tree are written to disk in Buildtreestatstype type, and function Accumulatetreestats () receives P and N. The command-line program sets P and N to 3 and 1 by default, but can be changed using the--context-with and--central-position options. Acc-tree-stats.cc receives a series of context-independent phonemes (for example, silence), which reduces the number of statistics.
Acc-tree-stats final.mdl \
scp:train.scp \
ark:JOB.ali \
JOB.TREEACC;
-sum-tree-statics.cc
The program constructs a summation statistic for the phoneme tree, which enters multiple *.TREEACC files, outputting a single cumulative post-statistic (such as TREEACC).
Sum-tree-stats treeacc \
phonesets.int \
questions.int;
-compile-questions.cc
The input to the program is 1. hmm topology (e.g., topo), 2. A list of phonemes (e.g., questions.int) that returns a list of phonemes represented by C + + objects that correspond to "key" in EventMap (such as Phonesets.qst).
Compile-questions data/lang/topo \
exp/triphones/questions.int \
exp/triphones/questions.qst;
-build-tree.cc
You can use build-tree.cc to build a tree when statistics accumulate. It has three input parameters: 1. Accumulated tree Statistics (TREEACC), 2. Problem configuration (QUESTIONS.QST), 3. root file (roots.int).
The tree statistic uses the program acc-tree-stats.cc to obtain, the problem disposition uses the program compile-questions.cc to obtain, cluster-phones.cc obtains the phoneme Problem topology list.
Build-tree established a series of decision trees, the maximum number of leaf nodes (such as 2000) is the number of probabilities, after the partition, each tree will do after clustering. The scope of the shared leaves is within a single tree.
Build-tree treeacc \
roots.int \
questions.qst \
topo \
tree;
You can use the program draw-tree.cc to view the decision tree.
Draw-tree data/lang/phones.txt \
exp/mono/tree | \
dot-tps-gsize=8,10.5 | \
ps2pdf-~/tree.pdf
-gmm-init-model.cc
Based on 1. Decision tree, 2. Cumulative Tree State (TREEACC), 3.HMM topology (topo) to initialize GMM Acoustic model (1.MDL).
Gmm-init-model tree \
treeacc \
topo \
1.mdl;
-gmm-mixup.cc
According to 1. Gaussian acoustic Model (1.MDL), 2. The number of occurrences of each state transition ID, the Gaussian merge operation, the returned Gaussian acoustic model (2.MDL) of the component will increase.
Gmm-mixup--mix-up= $numgauss \
1.mdl \
1.occs \
2.mdl
-convert-ali.cc
According to 1. The old GMM model (MONOPHONES_ALIGNED/FINAL.MDL), 2. The new GMM model (TRIPHONES_DEL/2.MDL), 3. New decision Tree (Triphones_del/tree), 4. The alignment of the Rspecifier (Monophones_aligned/ali. GZ), the program returns the new alignment (Triphones_del/ali. GZ).
Convert-ali monophones_aligned/final.mdl \
triphones_del/2.mdl \
triphones_del/tree \
monophones_aligned/ali.*.gz \
triphones_del/ali.*.gz
-compile-train-graphs.cc
Given input 1. Decision tree, 2. Acoustic Model (2.MDL), 3. Finite state Converter for dictionaries (L.FST), 4. The Rspecifier (text) of the training text set will output the Wspecifier (fsts.*.gz) of the training map.
Compile-train-graphs tree \
1.mdl \
l.fst \
text \
fsts.*.gz;
Bayesian Guidelines
P (Sentence | speech) =p (speech | sentence) ⋅p (sentence) p (speech) p (Sentence | speech) =\frac{p (speech | sentence) \cdot p (sentence)}{p (voice)}
where P (sentence) p (sentence) probability originates from a language model (e.g. N-gram)
P (Speech | sentence) P (speech | sentence) is a statistical model based on the training data set.
To find the speech corresponding to the sentence, is to find P (speech | sentence) P (speech | sentence) The most probable sentence. wfst key points
1.determinization
2.minimization
3.composition
4.equivalent
5.epsilon-free
6.functional
7.on-demand algorithm
8.weight-pushing
9.epsilon removal hmm key point
1.Markov Chain
2.Hidden Markov Model
3.forward-backward algorithm
4.Viterbi algorithm
5.e-m for mixture of Gaussians l.fst pronunciation dictionary fst
L Map a mono sequence to a word.
File L.fst is a finite state converter that obtains word sequences based on a sequence of phoneme symbols.
Clustering mechanism
The class gaussclusterable (Gaussian statistic) inherits the pure virtual class clusterable. The process of clustering objects inherited from the Clusterable class will be added in the future. The purpose of clusterable exists is to use a universal clustering algorithm.
The core idea of clusterable is to add the statistics and then evaluate the objective function, and the distance between the two objects is the effect of the target function and the target function of the two object.
Examples of the addition of clusterable classes are Gaussian models with mixed Gaussian statistics and count values for discrete observations.
An example of getting an Clusterable* object is as follows:
Vector<basefloat> X_stats (Ten), X2_stats (ten);
Basefloat count = 100.0, Var_floor = 0.01;
Initialize x_stats and x2_stats e.g.
AS//x_stats = * mu_i, X2_stats = + * (mu_i*mu_i + sigma^2_i)
clust Erable *cl = new Gaussclusterable (x_stats, X2_stats, Var_floor, Count);
Clustering Algorithm
The cluster functions are as follows:
-Clusterbottomup
-Clusterbottomupcompartmentalized
-Refineclusters
-Clusterkmeans
-Treecluster
-Clustertopdown
The data types that are commonly used are:
Std::vector<clusterable*> to_be_clustered;
K-means and its similar algorithm interface
The cluster code invocation instance is as follows:
Std::vector<clusterable*>