Python uses less than 100 lines of code to implement a small Siri

Source: Internet
Author: User
This article mainly describes the python using less than 100 lines of code to achieve a small Siri, the relevant information, the text is described in detail, for everyone has a certain reference value, the need for friends below to see it together.

Objective

If you want to be easy to understand the core of the feature calculation, it is recommended to look at my previous songs to listen to the article, Portal: http://www.jb51.net/article/97305.htm

This paper is to implement a simple command word recognition program, the core of the algorithm is to extract audio features, and the second is to use the DTW algorithm to match. Of course, such code must not be used for commercialization, it is good for everyone to play entertainment.

Design ideas

Even if it is a small thing, we must first clear the idea and then do it. Audio recognition, difficulty is not small, in which the difficulty of extracting features can be seen in the article I listen to the song. Speech recognition is more difficult because music is always fixed, and human speaking is often a change. For example, a "sesame open door", some people will say "open sesame", some people will say "open sesame". And the time to talk in the recording is not the same, it may be very urgent to start the recording on the speech, it may be too leisurely to record the end of the four words to say. It's a lot harder.

Algorithm Flow:


Feature Extraction

As with the previous song, the same is divided into 40 pieces a second, the Fourier transform each block, and then take the mold length. It's just that it's not as if you were listening to the song before the peak is extracted, but directly as the eigenvalues.

If you don't know what I'm talking about, you can look at the source code below, or listen to the song.

DTW algorithm

Dtw,dynamic time warping, dynamic timing. The algorithm solves the problem of matching different pronunciation lengths and positions to the best fit.

Algorithm input two sets of audio characteristics vector: A:[FP1,FP2,FP3,......, fpM1] B:[FP1,FP2,FP3,FP4,..... fpM2]
Group A has a total of M1 characteristics, B group has M2 audio. The element in each eigenvector is the vector of the FFT-modulo length that we have previously cut into 40 pieces per second. Euclidean distance is used to calculate the cost between each FP.

Set D (FPA,FPB) as the distance cost of two features.

So we can draw a diagram like the following

We need to go from point to Point (M1,M2), there are many ways to go, and each approach is a way to match two audio positions. But our goal is to take the least cost out of the total process, so that this alignment is the closest we can get to the alignment.

We walk this way: the first two axes of each point can be directly calculate the accumulative cost and calculated. Then for the Middle point D (i,j) = Min{d (i-1,j) +d (FPI,FPJ), D (i,j-1) +d (FPI,FPJ), D (i-1,j-1) + 2 * d (FPI,FPJ)}

Why does it take a twice-fold price to go straight to (i,j) this point (i-1,j-1)? Because other people take the square of the two right-angled side, it is walking a square diagonal ah

According to this principle choice, always count to D (M1,M2), this is the distance of two audio.

Source Code and comments

# coding=utf8import Osimport waveimport dtwimport numpy as Npimport pyaudiodef Compute_distance_vec (VEC1, VEC2): Return NP . Linalg.norm (VEC1-VEC2) #计算两个特征之间的欧氏距离class record (): Def record (self, chunk=44100, format=pyaudio.paint16, channels= 2, rate=44100, record_seconds=200, wave_output_filename= "Record.wav"): #录歌方法 p = pyaudio. Pyaudio () stream = P.open (Format=format, Channels=channels, Rate=rate, Input=true, frames_per_buffer= CHUNK) frames = [] for i in range (0, int (rate/chunk * record_seconds)): data = Stream.read (CHUNK) frames.append (da TA) stream.stop_stream () Stream.Close () p.terminate () wf = Wave.open (wave_output_filename, ' WB ') Wf.setnchannels (CHAN Nels) Wf.setsampwidth (P.get_sample_size (FORMAT)) wf.setframerate (rate) wf.writeframes (". Join (frames)) Wf.close () Class Voice (): Def loaddata (self, filepath): Try:f = Wave.open (filepath, ' rb ') params = F.getparams () self.nchanne LS, self.sampwidth, self.framerate, self.nframes = Params[:4] Str_data = F.readframes (self.nframes) Self.wave_data = np.fromstring (Str_data, Dtype=np.short) Self.wave_data.shap E =-1, self.sampwidth self.wave_data = Self.wave_data. T #存储歌曲原始数组 f.close () Self.name = Os.path.basename (filepath) # record the file name return True except:raise IOError, ' file Er Ror ' Def FFT (self, frames=40): Self.fft_blocks = [] #将音频每秒分成40块, and then Fourier transform on each block blocks_size = Self.framerate/frames for I In xrange (0, Len (self.wave_data[0))-Blocks_size, Blocks_size): Self.fft_blocks.append (Np.abs (Np.fft.fft (self.wave_ Data[0][i:i + blocks_size])) @staticmethod def play (filepath): chunk = 1024x768 WF = Wave.open (filepath, ' RB ') p = Pyaudio .      Pyaudio () # play Music Method stream = P.open (Format=p.get_format_from_width (Wf.getsampwidth ()), Channels=wf.getnchannels (), Rate=wf.getframerate (), output=true) while true:data = Wf.readframes (chunk) if data = = "": Break STREAM.W Rite (data) Stream.Close () p.terminate () if __name__ = = ' __main__ ': R = Record () R.record (REcord_seconds=3, wave_output_filename= ' record.wav ') v = Voice () v.loaddata (' Record.wav ') v.fft () File_list = Os.listdir ( OS.GETCWD ()) res = [] for i in File_list:if i.split ('. ') [1] = = ' wav ' and i.split ('. ') [0]! = ' record ': temp = Voice () Temp.loaddata (i) Temp.fft () Res.append ((Dtw.dtw (v.fft_blocks, temp.fft_blocks, com PUTE_DISTANCE_VEC) [0],i)] Res.sort () Print res if res[0][1].find (' open_qq ')! = -1:os.system (' C:\program\Tencent\QQ\ Bin\qqsclauncher.exe ') #我的QQ路径 elif res[0][1].find (' zhimakaimen ')! = -1:os.system (' Chrome.exe ') #浏览器的路径, Previously added to path elif res[0][1].find (' play_music ')! = -1:voice.play (' C:\data\music\\audio\\audio\\ (9). wav ') #播放一段音乐 # R = Record () # R.record (record_seconds=3,wave_output_filename= ' zhimakaimen_09.wav ')

You can use the record method here to record several command words, try to say in different tones, different rhythm said, this can improve accuracy. Then design the file name, according to the closest to the audio file name to know what kind of command, in order to customize the execution of different tasks

This is a demo video: http://www.iqiyi.com/w_19ruisynsd.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.