Listen to songs, as the name implies, with the device "listen to" the song, and then it will tell you this is the first song. And ten it has to play this song for you. Such a function in QQ music and other applications have long appeared. We're going to do our own listening song today.
The overall flowchart we designed is simple:
-----
Recording section
-----
If we want to "listen", we must first have the recording process. In our experiment, our music library also use our recording code to record, then extract the features into the database. We're going to use the following ideas to record
# coding=utf8import waveimport pyaudioclass recode (): def recode (self, chunk=44100, format= Pyaudio.paint16, channels=2, rate=44100, record_seconds=200, wave_output_filename= "Record.wav"): ':p Aram CHUNK: Buffering Area size:p Aram FORMAT: Sample size:p Aram CHANNELS: Number of channels:p Aram rate: Sample rates:p Aram Record_seconds: Recorded time:p Aram Wave_output_filename: Output File path: return: ' P = pyaudio. Pyaudio () stream = P.open (Format=format, Channels=channels, Rate=rate, Input=true, frames_per_buffer= CHUNK) frames = [] for i in range (0, int (rate/chunk * record_seconds)): data = Stream.read (CHUNK) frames.append (da TA) stream.stop_stream () Stream.Close () p.terminate () wf = Wave.open (wave_output_filename, ' WB ') Wf.setnchannels (CHAN Nels) Wf.setsampwidth (P.get_sample_size (FORMAT)) wf.setframerate (rate) wf.writeframes (". Join (frames)) Wf.close () if __name__ = = ' __main__ ': a = Recode () a.recode (record_seconds=30, wave_output_filename= ' record_pianai.wav ')
What is the form of the songs we've recorded?
If you look at only one channel, he's a one-dimensional array, probably like this.
We put him on the horizontal axis of the index, which is the form of the audio we often see.
Audio Processing Section
We are here to write our core code. The key to "how to identify the song". Think about how we humans distinguish songs? Does it depend on the one-dimensional array like the above? Does it depend on the loudness of the song? is not.
We memorize songs by a sequence of unique frequencies that our ears hear, so if we want to write songs, we have to make a fuss about the frequency sequence of the audio.
Review what the Fourier transform is. Bo Master's "Signal and system" class in the water, but in class although not recorded in the specific transformation form, but the perceptual understanding is still there.
The essence of Fourier transform is to transform the time domain signal into the frequency domain signal. That is, the x, y axes are our array subscripts and arrays, which are now frequency (so inaccurate, but here it's true) and the size of the component at this frequency.
How to understand the frequency domain this thing? For those of us who do not understand the signal processing, the most important thing is to change the understanding of the composition of the audio. We thought that audio was just like the waveform we started with, at each time there was a magnitude, and different amplitude sequences formed our particular sound. Now, we think that sound is a mixture of different frequency signals, and each of them is always there. And they contribute according to their projection component.
Let's see what it's like to convert a song to the frequency domain.
We can observe that the components of these frequencies are not average and the differences are very large. We can think to some extent that the apparent peak in the graph is a frequency signal with large output energy, which represents the high status of this signal in this audio. So we choose this signal to extract the characteristics of the song.
But don't forget, what we said before is the frequency sequence, Fourier transform a set, we can only know the frequency of the whole song, then we lost the relationship between the time, we say the "sequence" also can not talk about. So we used a more eclectic approach, the audio is divided into small pieces of time, where I split the 40 blocks per second.
Leave a question here: why use small chunks instead of chunks like this one per second?
We make a Fourier transform on each block, and then we model it, and we get a set of arrays. We have the subscript value (0,40), (40,80), (80,120), (120,180) The four intervals take the largest subscript of its modulus, the synthesis of a four-tuple, this is our most core audio "fingerprint."
The "fingerprints" we extracted are similar to the following
(39, 65, 110, 131), (15, 66, 108, 161), (3, 63, 118, 146), (11, 62, 82, 158), (15, 41, 95, 140), (2, 71, 106, 143), (15, 4 4, 80, 133), (36, 43, 80, 135), (22, 58, 80, 120), (29, 52, 89, 126), (15, 59, 89, 126), (37, 59, 89, 126), (37, 59, 89, 1 26), (37, 67, 119, 126)
The audio processing class has three methods: Load data, Fourier transform, play music.
As follows:
# coding=utf8import Osimport reimport waveimport numpy as Npimport pyaudioclass Voice (): Def loaddata (self, filepath): " ':p Aram filepath: File path to WAV file: return TRUE if no exception is returned, if there is an exception exit and return False self.wave_data the multichannel audio data is stored inside, where self.wave_data[0] Represents the first channel specifically has several channels, see Self.nchannels ' if Type (filepath)! = Str:print ' The type of filepath must be string ' return Fals e p1 = re.compile (' \.wav ') if P1.findall (filepath) is none:print ' The suffix of file must be. wav ' return False TR Y:f = Wave.open (filepath, ' rb ') params = F.getparams () self.nchannels, Self.sampwidth, Self.framerate, Self.nframes = Params[:4] Str_data = F.readframes (self.nframes) Self.wave_data = np.fromstring (Str_data, Dtype=np.short) Self.wa Ve_data.shape =-1, self.sampwidth self.wave_data = Self.wave_data. T f.close () Self.name = Os.path.basename (filepath) # record the file name return True except:print ' file error! ' def FFT (self, frames=40): "':p Aram Frames:frames is the number of blocks per second: return: ' ' block = [] FFT_blocks = [] Self.high_point = [] blocks_size = self.framerate/frames # block_size for each piece of frame number blocks_num = Self.nfra Mes/blocks_size # The number of audio tiles for I in xrange (0, Len (self.wave_data[0))-Blocks_size, Blocks_size): Block.append (SELF.W Ave_data[0][i:i + blocks_size]) fft_blocks.append (Np.abs (Np.fft.fft (self.wave_data[0][i:i + blocks_size))) Self.high_point.append (Np.argmax (fft_blocks[-1][:40]), Np.argmax (fft_blocks[-1][40:80]) +, Np.argmax (f FT_BLOCKS[-1][80:120]) + Np.argmax (fft_blocks[-1][120:180]) +, # Np.argmax (fft_blocks[-1][180:300] ) + 180,) # The key step to extract the fingerprint, not the last one, but keep this one, you can think about why it was removed? def play (self, filepath): The method used to do audio playback:p Aram filepath: File path: return: ' chunk = 1024x768 WF = Wave.open (filepath, ' RB ') p = pyaudio. Pyaudio () # Open the sound output stream stream = P.open (Format=p.get_format_from_width (Wf.getsampwidth ()), Channels=wf.getnchannels (), Rate=wf.getframerate (), output=true) # write the sound output stream to play while True:data = Wf.readframes (chunk) if data = = "": Break Stream.Write (data) Stream.Close () p.terminate () if __name__ = = ' _ _main__ ': P = voice () p.loaddata (' Record_beiyiwang.wav ') P.fft ()
The Self.high_point is the core data for future applications. The list type, in which the elements are in the form of a fingerprint as explained above.
Data storage and Retrieval section
Because we have done the music library to wait for the retrieval, we must have the corresponding persistence method. I use the MySQL database directly to store the corresponding fingerprint of our songs, so there is a benefit: Save time to write code
We save fingerprints and songs in this form:
By the way: Why are the first few fingerprints of each song the same? (Of course, the following must be very different) in fact, the time period before the start of music is not a strong point of energy, and because our 44100 sample rate is higher, it will lead to a lot of repetition, don't worry.
How do we make a match? We can search directly for the same number of audio fingerprints, but then we lose the sequence we said earlier, we have to use the time series. Otherwise, the longer a song is easier to match, the song is as crazy as a weed occupies the first place in the leaderboard for all search audio results. And in theory, the information contained in the audio is reflected in the sequence, just as a sentence can express itself in a certain order by each phrase and vocabulary. Simply looking at the number of words in the two sentences is not exactly the same as the two words. We are using the following algorithm, but we are just experimental code, algorithm design is very simple, inefficient. Students who suggest that they want to make better results can use the improved DTW algorithm.
We slide the fingerprint sequence during the matching process, each time compared to the pattern string and the corresponding substring of the source string, if the corresponding position of the fingerprint is the same, then the comparison of the similarity value plus one, we have the sliding process to obtain the maximum similarity value as the two song similarity.
Example:
The fingerprint sequence of a song in music library: [Fp13, Fp20, Fp10, Fp29, FP14, Fp25, Fp13, Fp13, Fp20, FP33, FP14]
Retrieving the fingerprint sequence of music: [Fp14, Fp25, FP13, FP17]
The comparison process:
The final match similarity value is 3
Store the implementation code for the Retrieval Section
# Coding=utf-8import Osimport mysqldbimport My_audioclass Memory (): Def __init__ (self, host, port, user, passwd, db): " Initialize storage class:p Aram Host: Host location:p Aram Port: Port:p Aram User: Username:p aram passwd: Password:p Aram DB: Database name "' Self.host = host self . Port = Port Self.user = User SELF.PASSWD = passwd self.db = db def addsong (self, Path): "' Adds a song method, extracts the song of the specified path and puts it into the data Library:p Aram Path: path: return: ' If Type ' (path)! = Str:print ' path need string ' return None basename = Os.path.basen Ame (path) Try:conn = MySQLdb.connect (Host=self.host, Port=self.port, User=self.user, passwd=self.passwd, Db=self.db, charset= ' UTF8 ') # Create a connection to the database Except:print ' database error ' return None cur = conn.cursor () Namecount = cur . Execute ("SELECT * from fingerprint.musicdata where song_name = '%s '"% basename) # Query whether the newly added song is already in the song Gallery if Namecount > 0:print ' The song has been record! ' return None v = my_audio.voice () v.loaddata (path) V.fft () Cur.execute ("INSERT INTO FINGERPRINT.MUSICDATA VALUES ('%s ', '%s ') '% (basename, v.high_point.__str__ ()) # Save the name and fingerprint of the new song in the Database Conn.commit () cur.close () Conn.close () def fp_compare (self, SEARCH_FP, MATCH_FP): "Fingerprint comparison method." :p Aram search_fp: Query fingerprint:p Aram MATCH_FP: library fingerprint: return: Maximum similar value float ' If Len (SEARCH_FP) > Len (MATCH_FP): return 0 Max_similar = 0 Search_fp_len = Len (search_fp) Match_fp_len = Len (MATCH_FP) for I in range (Match_fp_len-search_fp_le N): temp = 0 for J in Range (Search_fp_len): If Match_fp[i + j] = = Search_fp[j]: Temp + 1 if temp > Max_si Milar:max_similar = temp return max_similar def search (self, Path): "Retrieved from database:p Aram Path: The path of the audio to be retrieved: return: Back to the list, the element is a two-tuple, the first item is a matching similar value, the second item is the song name ' V = my_audio.voice () v.loaddata (path) v.fft () Try:conn = MySQLdb.connect (host=se Lf.host, Port=self.port, User=self.user, passwd=self.passwd, db=self.db, charset= ' UTF8 ') except:print ' DataBas E error ' return None cur = conn.cursor () cur.execute ("Select * from Fingerprint.musIcdata ") result = Cur.fetchall () compare_res = [] for i in Result:compare_res.append (Self.fp_compare (v.high_point[: -1], eval (i[1]), i[0]) compare_res.sort (reverse=true) cur.close () conn.close () Print compare_res return compare_res def search_and_play (self, Path): "Is the same as the previous method, but adds the ability to play the best results of the search directly:p Aram path: With the path of the retrieved song: return:" v = my_audio.v Oice () v.loaddata (path) v.fft () # print V.high_point try:conn = MySQLdb.connect (Host=self.host, Port=self.port, use R=self.user, PASSWD=SELF.PASSWD, db=self.db, charset= ' UTF8 ') except:print ' DataBase error ' return None cur = Conn.cursor () cur.execute ("SELECT * from Fingerprint.musicdata") result = Cur.fetchall () compare_res = [] for i in R Esult:compare_res.append ((Self.fp_compare (v.high_point[:-1], eval (i[1)), i[0])) Compare_res.sort (reverse=true) Cur.close () conn.close () print compare_res v.play (compare_res[0][1]) return compare_resif __name__ = = ' __main__ ': SSS = Memory (' localhost ', 3306, 'Root ', ' root ', ' fingerprint ') sss.addsong (' Taiyangzhaochangshengqi.wav ') sss.addsong (' Beiyiwangdeshiguang.wav ') Sss.addsong (' Xiaozezhenger.wav ') sss.addsong (' Nverqing.wav ') sss.addsong (' The_mess.wav ') Sss.addsong (' Windmill.wav ') sss.addsong (' End_of_world.wav ') sss.addsong (' Pianai.wav ') sss.search_and_play (' Record_ Beiyiwang.wav ')
Summarize
Many of our experiments are very rough, and the core algorithm is the idea of "fingerprints" drawn from the algorithm Shazam by the company. I hope readers can make valuable suggestions.