Speech Signal Processing (2) pitch estimation (pitch detection)

Source: Internet
Author: User

Speech Signal Processing (2) pitch estimation (pitch detection)

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

 

This semester I have a speech signal processing course, and I am about to take the test. So I need to know more about the relevant knowledge. Haha, I have no lectures at ordinary times, but now I have to stick to it. By the way, I would like to make my knowledge architecture clearer and share it with you. The second knowledge point is summarized below: pitch estimation. We use C ++ to implement pitch detection based on the self-correlation function method, and use opencv to display the speech waveform. Because it takes a little time, there may be a lot of things wrong with it. I hope you can correct it. Thank you.

 

I. Overview

1.1 pitch and pitch estimation

When people pronounce, voice signals can be divided into two types based on whether the vocal cords vibrate. Voiced speech, also known as The Voice language, carries a large amount of energy in the language. voiced speech presents obvious periodicity in the time domain, while voiced speech is similar to white noise and has no obvious periodicity. When a voiced sound is sent, the air flow generates a relaxation vibration through the gloal cords, resulting in a quasi-cycle excitation pulse string. The pitch frequency is called the pitch frequency.

Generally, the pitch frequency is related to the length, thickness, toughness, strength, and pronunciation habits of the individual vocal cords, and reflects the personal characteristics to a large extent. In addition, the pitch frequency varies with gender and age. Generally, male speakers have low pitch frequencies, while female and children have relatively high pitch frequencies.

The final purpose of pitch detection is to find the trajectory curves with the sound band vibration frequencies completely consistent or as close as possible.

As one of the important parameters for describing the excitation source in speech signal processing, Gene cycle has a wide range of important problems in speech synthesis, speech compression coding, speech recognition, speaker confirmation, and other fields, especially for Chinese. Chinese is a language with adjustment, and the change of gene cycle is called the tone. The tone is very important for the understanding of Chinese speech. In Chinese conversations, different vowels and consonants must be used to identify the meaning of these words, and different tones must be used to distinguish them. That is to say, tones have the function of distinguishing meanings; in addition, there is a phenomenon of polyphonic words in Chinese. different tones of the same word or different meanings have different tones. Therefore, accurate and reliable pitch detection is particularly important for processing Chinese speech signals.

 

1.2. Existing methods for pitch estimation

So far, the pitch detection methods can be divided into three categories:

1) Time-Domain estimation: the pitch cycle is directly estimated by the speech waveform. Common examples include self-correlation, parallel processing, average amplitude difference, and data reduction;

2) transform method: it is a method to transform the voice signal to the frequency or time domain to estimate the pitch period. First, the influence of the audio channel is eliminated by means of homomorphic analysis, obtains information of the excitation part and obtains the pitch period. The most common method is the cepstrum method. The disadvantage of this method is that the algorithm is complicated, but the pitch estimation is very effective;

3) Mixing Method: extract the signal channel model parameters, filter the signals, obtain the audio source sequence, and use the auto correlation method or average amplitude difference method to obtain the gene audio cycle.

 

Iii. Self-correlation based Pitch Detection

3.1. Auto-related functions

Short-term auto-correlation functions of the limited-energy voice signal x (n) are defined:

This formula represents the similarity between a signal and the signal itself after the delay of M points. If the signal x (n) is cyclical, its self-correlation function is also cyclical, and the cycle is the same as the periodicity of the signal x (n. The auto-correlation function provides a method to obtain the periodic signal period. In an integer multiple of the cycle signal period, its auto-correlation function can reach the maximum value, so you can ignore the start time, the pitch period of the signal is estimated from the first maximum position of the auto-correlation function, which makes the auto-correlation function a tool for estimating the Pitch Period of the signal.

 

3.2 Short-term auto-correlation function method

The speech signal is a non-steady Signal and Its Features change over time. However, in a short period of time, it can be considered as having relatively stable features, that is, short-term stability. Therefore, speech has short-term auto-correlation. This time period is about 5 ms-50 ms. The statistical and spectrum features are for a short period of time. This makes it necessary to split the voice signal into frames in a short period of time for digital processing. In this way, each frame of signal is stable for a short time, so as to perform short-term correlation analysis.

Short-term self-correlation functions of the voice signal S (n) with limited energy are defined:

Generally, a frame must contain at least two cycles. Generally, the fundamental frequency is at least 50Hz, so the cycle is at most 20 ms. In addition, there must be enough overlap between adjacent frames. The frame length is determined based on the sampling rate.

In the auto-correlation function of the frame, except for the first maximum value (at 0) and the maximum value Kmax = 114, the fundamental frequency of the frame is 16 kHz/114 = 140Hz.

 

IV. Implementation of Self-correlation-based Pitch Detection Algorithm

This course must be implemented in C ++. Then, in order to draw waveforms, I used opencv, which I am familiar. The waveform drawn by opencv is still good, and if it is a dynamic waveform translation, it looks pretty good, just as the ECG is so touching.

In this experiment, the audio WAV file of a male voice reading "playing" is used, which is a 16 KHz sampling rate and 16 bit quantization. The entire speech segment is 656.7 ms and contains 10508 nodes.

Determine the frame length first. The following figure shows the number of cycles contained when the frame length is 200,320 and 400 nodes respectively. 200 has only one cycle, and 400 has three cycles. Therefore, we adopt a frame length of 400.

Voice and unvoice are distinguished by calculating short-term energy. The short-term average energy en of a frame signal of the voice signal {x (n)} is defined:

The short-term average energy of voiced segments in speech is much greater than the short-term average energy of voiced segments. Therefore, the calculation of short-term average energy provides the basis for distinguishing between the voiced and voiced segments, that is, en (turbidity)> en (Qing ).

During the calculation of each frame, the position in the original waveform is displayed, and the pitch of the frame is displayed in real time. In addition, the original waveform of the frame is displayed in another window in real time.

The original waveform of the frame (the following two frames at different time will change dynamically ):

The figure on the left below shows the pitch point corresponding to all frames of the speech. The figure shows that there are many wild points. Because, you need to further process this, that is, remove the field. Here, we use the median filter to remove the field. The filtering result is shown in the right figure.

The C ++ program is as follows:(Enter the next step every time you press a space)

// Description : Pitch detection// Author      : Zou Xiaoyi// HomePage    : http://blog.csdn.net/zouxy09// Date        : 2013/06/08// Rev.        : 0.1#include <iostream>#include <fstream>#include "opencv2/opencv.hpp"#include "ReadWriteWav.h"#include <string>using namespace std;using namespace cv;#define MAXLENGTH 1000void wav2image(Mat &img, vector<short> wavData, int wav_start, int width, int max_amplitude){ short max(0), min(0); for (int i = 0; i < wavData.size(); i++) {  if (wavData[i] > max)max = wavData[i];  if (wavData[i] < min)min = wavData[i]; } cout<<max<<'\t'<<min<<endl; max_amplitude = max_amplitude > 480 ? 480 : max_amplitude; // normalize for (int i = 0; i < wavData.size(); i++) { wavData[i] = (wavData[i] - min) * max_amplitude / (max - min); } int j = 0; Point prePoint, curPoint; if (width >= 400) { img.create(max_amplitude, width, CV_8UC3); img.setTo(Scalar(0, 0, 0)); for (int i = wav_start; i < wav_start + width; i++) {  prePoint = Point(j, img.rows - (int)wavData[i]);  if (j)line(img, prePoint, curPoint, Scalar(0, 255, 0), 2);  curPoint = prePoint;  j++; }  if (width > MAXLENGTH) { cout<<"The wav is too long to show, and it will be resized to 1200"<<endl; resize(img, img, Size(MAXLENGTH, img.rows)); } } else { img.create(max_amplitude, 400, CV_8UC3); img.setTo(Scalar(0, 0, 0)); for (int i = wav_start; i < wav_start + width; i++) { prePoint = Point(j*400/width, img.rows - (int)wavData[i]); circle(img, prePoint, 3, Scalar(0, 0, 255), CV_FILLED); j++; } cout<<"The wav is too small to show, and it will be resized to 400"<<endl; }}short calOneFrameACF(vector<short> wavFrame, int sampleRate){vector<float> acf;acf.empty();// calculate ACFfor (int k = 0; k < wavFrame.size(); k++){float sum = 0.0;for (int i = 0; i < wavFrame.size() - k; i++){sum = sum + wavFrame[i] * wavFrame[ i + k ];}acf.push_back(sum);}// find the max onefloat max(-999);int index = 0;for (int k = 0; k < wavFrame.size(); k++){if (k > 25 && acf[k] > max){max = acf[k];index = k;}}return (short)sampleRate / index;}int main(){    const char *wavFile = "bofang.wav";    vector<short> data;int nodesPerFrame = 400;/************* Write data to file part Start ***************/fstream writeFile;writeFile.open("statistics.txt", ios::out);/************* Write data to file part End ***************//************* Read and show the input wave part Start ***************/int sampleRate;    int dataLength = wav2allsample(wavFile, data, sampleRate);    if (!dataLength){cout <<"Reading wav file error!"<<endl;return -1;}Mat originalWave;wav2image(originalWave, data, 0, dataLength, 400);line(originalWave, Point(0, originalWave.rows * 0.5), Point(originalWave.cols, originalWave.rows * 0.5), Scalar(0, 0, 255), 2);imshow("originalWave", originalWave);// write datawriteFile<<"Filename: "<<wavFile<<endl<<"SampleRate: "<<sampleRate<<"Hz"<<endl<<"dataLength: "<<dataLength<<endl;cout<<"Press space key to continue"<<endl;while (waitKey(30) != ' ');/************* Read and show the input wave part End ***************//******** Calculate energy to separate voice and unvoice part Start *********/int nodeCount = 0;// The sum must be double typevector<double> energyTmp;double maxEnergy(0);while(nodeCount < (dataLength - nodesPerFrame)){double sum(0);for (int i = nodeCount; i < (nodeCount + nodesPerFrame); i++){sum += (double)data[i] * data[i];}if (sum > maxEnergy){maxEnergy = sum;}energyTmp.push_back(sum);nodeCount++;}// Transform to short type for showvector<short> energy;// Fill element of boundaryshort tmp = (short)(energyTmp[0] * 400 / maxEnergy);for (int i = 0; i < nodesPerFrame * 0.5; i++){energy.push_back(tmp);}for (int i = 0; i < energyTmp.size(); i++){energy.push_back((short)(energyTmp[i] * 400 / maxEnergy));}// Fill element of boundarytmp = (short)(energyTmp[energyTmp.size() - 1] * 400 / maxEnergy);for (int i = 0; i < nodesPerFrame * 0.5; i++){energy.push_back(tmp);}// showMat showEnergy;wav2image(showEnergy, energy, 0, energy.size(), 400);line(showEnergy, Point(0, showEnergy.rows - 1), Point(showEnergy.cols, showEnergy.rows - 1), Scalar(0, 0, 255), 2);imshow("showEnergy", showEnergy);while (waitKey(30) != ' ');// separate voice and unvoicefloat thresVoice = 400 * 0.15;line(showEnergy, Point(0, showEnergy.rows - thresVoice), Point(showEnergy.cols, showEnergy.rows - thresVoice), Scalar(0, 255, 255), 2);imshow("showEnergy", showEnergy);while (waitKey(30) != ' ');// Find the Transition point and draw thembool high = false;vector<int> separateNode;for (int i = 0; i < energy.size(); i++){if ( !high && energy[i] > thresVoice){separateNode.push_back(i);high = true;writeFile<<"UnVoice to Voice: "<<i<<endl;line(showEnergy, Point(i * MAXLENGTH / dataLength, 0), Point(i * MAXLENGTH / dataLength, showEnergy.rows), Scalar(255, 255, 255), 2);putText(showEnergy, "Voice", Point(i * MAXLENGTH / dataLength, showEnergy.rows * 0.5 + 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 255, 255), 2);imshow("showEnergy", showEnergy);while (waitKey(30) != ' ');}if ( high && energy[i] < thresVoice){separateNode.push_back(i);high = false;writeFile<<"Voice to UnVoice: "<<i<<endl;line(showEnergy, Point(i * MAXLENGTH / dataLength, 0), Point(i * MAXLENGTH / dataLength, showEnergy.rows), Scalar(255, 0, 0), 2);putText(showEnergy, "UnVoice", Point(i * MAXLENGTH / dataLength, showEnergy.rows * 0.5 + 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 0, 0), 2);imshow("showEnergy", showEnergy);while (waitKey(30) != ' ');}}/******** Calculate energy to separate voice and unvoice part End ***********//******************* Calculate all frame part Start ***************/int frames = 0;vector<short> allPitchFre;writeFile<<"The pitch frequency is:"<<endl;while(frames < 2 * dataLength / nodesPerFrame){vector<short> wavFrame;wavFrame.empty();// get one frame, 400 nodes per frame, and shift 200 nodes, or overlap 200 nodesint start = frames * nodesPerFrame * 0.5;for (int i = start; i < start + nodesPerFrame; i++)wavFrame.push_back(data[i]);// calculate the ACF of this framefloat pitchFreqency = calOneFrameACF(wavFrame, sampleRate);allPitchFre.push_back(pitchFreqency);cout<<"The pitch frequency is: "<<pitchFreqency <<" Hz"<<endl;writeFile<<pitchFreqency<<endl;// show current frame in the whole waveMat originalWave;wav2image(originalWave, data, 0, dataLength, 400);line(originalWave, Point(0, originalWave.rows * 0.5), Point(originalWave.cols, originalWave.rows * 0.5), Scalar(0, 0, 255), 2);line(originalWave, Point(start * MAXLENGTH / dataLength, 0), Point(start * MAXLENGTH / dataLength, originalWave.rows), Scalar(0, 0, 255), 2);line(originalWave, Point((start + nodesPerFrame)* MAXLENGTH / dataLength, 0), Point((start + nodesPerFrame)* MAXLENGTH / dataLength, originalWave.rows), Scalar(0, 0, 255), 2);// put the pitchFreqency of this frame in the whole wavestringstream buf;buf << pitchFreqency;string num = buf.str();putText(originalWave, num, Point(start * MAXLENGTH / dataLength, 30), FONT_HERSHEY_SIMPLEX, 0.7, Scalar(0, 0, 255), 2);imshow("originalWave", originalWave);// show current frame in zoom out modelMat oneSelectFrame;wav2image(oneSelectFrame, wavFrame, 0, wavFrame.size(), 400);imshow("oneSelectFrame", oneSelectFrame);if (!frames)while (waitKey(30) != ' ');frames++;waitKey(50);}cout<<"Num of frames is: "<<frames<<endl;/******************* Calculate all frame part End ***************/// show all pitch frequency before smoothMat showAllPitchFre;wav2image(showAllPitchFre, allPitchFre, 0, allPitchFre.size(), 400);putText(showAllPitchFre, "Before smooth", Point(10, showAllPitchFre.rows - 20), FONT_HERSHEY_SIMPLEX, 1, Scalar(60, 200, 255), 1);imshow("showAllPitchFre", showAllPitchFre);/******************* Smooth by medium filter part Start **************/int kernelSize = 5;vector<short> afterMedFilter;short sum(0);afterMedFilter.assign(allPitchFre.size(), allPitchFre[0]);for (int k = cvFloor(kernelSize/2); k < allPitchFre.size(); k++){vector<short> kernelData;for (int i = -cvFloor(kernelSize/2); i < cvCeil (kernelSize/2); i++)kernelData.push_back(allPitchFre[k+i]);nth_element(kernelData.begin(), kernelData.begin() + cvCeil (kernelSize/2), kernelData.end());afterMedFilter[k] = kernelData[cvCeil (kernelSize/2)];sum += afterMedFilter[k];cout<<afterMedFilter[k]<<endl;}// show all pitch frequency and mean pitch frequency after smoothMat showAfterMedFilter;wav2image(showAfterMedFilter, afterMedFilter, 0, afterMedFilter.size(), 400);putText(showAfterMedFilter, "After smooth", Point(10, showAfterMedFilter.rows - 20), FONT_HERSHEY_SIMPLEX, 1, Scalar(60, 200, 255), 1);short mean = sum / (afterMedFilter.size() - cvFloor(kernelSize/2));writeFile<<"The mean pitch frequency is: "<<mean<<endl;stringstream buf;buf << mean;string num = "Mean: " + buf.str() + "Hz";putText(showAfterMedFilter, num, Point(10, 40), FONT_HERSHEY_SIMPLEX, 1, Scalar(255, 200, 255), 2);imshow("showAfterMedFilter", showAfterMedFilter);/******************* Smooth by medium filter part End ***************/while (waitKey(30) != 27);    return 0;}

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.