Listen to music, as the name suggests, with the device "Listen" to the song, and then it tells you what it is the first song. And ten it has to play the song for you. Such a function in the QQ music and other applications have already appeared. We're going to do our own music today.
The overall flowchart we designed is simple:
-----
Recording Section
-----
If we want to "listen", we have to have the recording process first. In our experiments, our music library will also use our recording code to record, and then extract features into the database. We'll use the following ideas to record.
# Coding=utf8 Import Wave Import Pyaudio class Recode (): def recode (self, chunk=44100, format=pyaudio.paint16, Channe ls=2, rate=44100, record_seconds=200, wave_output_filename= "Record.wav"): ':p Aram CHUNK: Buffer size:p Aram FORMAT : Sample size:p Aram Channels: Number of channels:p Aram RATE: Sampling rate:p Aram Record_seconds: Record time:p Aram: Output file path: return : ' P = pyaudio. Pyaudio () stream = P.open (Format=format, Channels=channels, Rate=rate, Input=true, Frames_per_bu Ffer=chunk) frames = [] for i in range (0, int (rate/chunk * record_seconds)): data = Stream.read (CHUNK) frames. Append (data) Stream.stop_stream () Stream.Close () p.terminate () wf = Wave.open (wave_output_filename, ' WB ') wf.set Nchannels (Channels) wf.setsampwidth (P.get_sample_size (FORMAT)) wf.setframerate (RATE) wf.writeframes (". Join" ( Frames) Wf.close () if __name__ = = ' __main__ ': a = Recode () a.recode (record_seconds=30, wave_output_filename= ' record _pianai.waV ')
What is the form of the song that we have finished recording?
If you look at only one channel, he's a one-dimensional array, probably like this.
We draw him on the axis of the index, which is the form of audio that we often see.
Audio Processing Section
We're here to write our core code. Key "How to identify songs". Think about how we humans differentiate between songs. Do you think of a one-dimensional array like the one above? Is it by the loudness of the song? Not at all.
We remember songs by the sequence of frequencies that we hear in our ears, so we want to write about music, and we have to make a fuss over the frequency sequence of the audio.
Review what Fourier transforms are. Bo Master of the "Signal and system" of the class on the water, but although in the class did not write down the specific transformation form, but the perceptual understanding or some.
The essence of Fourier transform is to transform time domain signal into frequency domain signal. That is, the original x,y axis is our array subscript and group element, now it becomes the frequency (so inaccurate, but here is true) and the size of the component at this frequency.
How to understand the frequency domain this thing? For those of us who are not very understanding of signal processing, the most important thing is to change the understanding of the composition of audio. We thought the audio was just like the waveform we started giving, there was a amplitude at every time, and different amplitude sequences made up our particular sound. And now, we think that sound is a mixture of different frequency signals, and each of them has a single signal. And they contribute according to their projection component.
Let's see what it looks like to transform a song into a frequency domain.
We can observe that the components of these frequencies are not mean, the difference is very large. We can say to some extent that the peak value in the figure is the frequency signal with large output energy, which means that the signal occupies a high position in this audio. So we chose this signal to extract the characteristics of the song.
But don't forget, we said before is the frequency sequence, Fourier transform a set, we can only know the frequency of the whole song information, then we lost the relationship between time, we say "sequence" is also impossible to talk about. So we used a more eclectic approach, dividing the audio into small chunks of time, where I split 40 blocks per second.
Here's a question: why use small chunks instead of chunks of this one per second?
We do a Fourier transform on each block, then we model it, and we get an array of numbers. We are in the subscript value (0,40), (40,80), (80,120), (120,180) These four intervals respectively take its modulus of the largest subscript, the synthesis of a four-tuple, which is our core audio "fingerprint."
The "fingerprints" we extracted were similar to the following.
(39, 65, 110, 131), (15, 66, 108, 161), (3, 63, 118, 146), (11, 62, 82, 158), (15, 41, 95, 140), (2, 71, 106, 143), (15, 4 4, 80, 133), (36, 43, 80, 135), (22, 58, 80, 120), (29, 52, 89, 126), (15, 59, 89, 126), (37, 59, 89, 126), (37, 59, 89, 1) 26), (37, 67, 119, 126)
The audio processing class has three methods: loading data, Fourier transform, playing music.
As follows:
# Coding=utf8 import OS import re import wave import NumPy as NP import Pyaudio class voice (): Def loaddata (self, file Path): ':p Aram filepath: File path, WAV file: return: Returns True if there is no exception, if there is an exception exit and returns false Self.wave_data stored multi-channel audio data, where SELF.W Ave_data[0] represents the first channel specific several channels, see Self.nchannels ' if Type (filepath)!= str:print ' The type of filepath must be string
' Return False p1 = re.compile (' \.wav ') if P1.findall (filepath) is none:print ' suffix of file must '. wav ' return False try:f = Wave.open (filepath, ' rb ') params = F.getparams () self.nchannels, self.sampwidth, self . framerate, Self.nframes = params[:4] Str_data = F.readframes (self.nframes) Self.wave_data = np.fromstring (Str_data, Dtype=np.short) Self.wave_data.shape =-1, self.sampwidth self.wave_data = Self.wave_data.
T f.close () Self.name = Os.path.basename (filepath) # Records under file name return True except:print ' file error! ' def FFT (self, frames=40): ":p Aram Frames:framesis to specify the number of blocks per second: return: "' block = [] Fft_blocks = [] Self.high_point = [] blocks_size = self.framerate/frames # Block_size for the number of frame per piece Blocks_num = self.nframes/blocks_size # will the number of audio blocks for I in xrange (0, Len (self.wave_data[0))- Blocks_size, Blocks_size): Block.append (self.wave_data[0][i:i + blocks_size]) Fft_blocks.append (Np.abs (s Elf.wave_data[0][i:i + blocks_size])) Self.high_point.append (Np.argmax (fft_blocks[-1][:40)), Np.argmax (fft_b LOCKS[-1][40:80]) + +, Np.argmax (fft_blocks[-1][80:120]) + (Np.argmax (fft_blocks[-1][120:180)) + 120
, # Np.argmax (fft_blocks[-1][180:300]) + 180,) # The key steps to extract the fingerprint, not the last one, but keep this one, can you think of why you removed it? def play (self, filepath): "The method used to do audio playback:p Aram filepath: File path: return:" chunk = 1024 WF = Wave.open (fil Epath, ' RB ') p = pyaudio. Pyaudio () # open sound Output stream = P.open (Format=p.get_format_from_width (Wf.getsampwidth ()), Channels=wf.getnchannels (
),Rate=wf.getframerate (), output=true) # write sound output stream play while True:data = Wf.readframes (chunk) if data = "": Break Stream.Write (data) Stream.Close () p.terminate () if __name__ = = ' __main__ ': P = voice () P.loaddata (' re
Cord_beiyiwang.wav ') P.fft ()
This self.high_point is the core data for future applications. List type, the elements inside are the forms of fingerprints that are explained above.
Data storage and Retrieval section
Because we have done the music library to wait for the retrieval, so we must have a corresponding persistence method. I used the MySQL database directly to store our songs corresponding to the fingerprint, so there is a benefit: Save time to write code
We save fingerprints and songs in this form:
By the way: Why are the first few fingerprints of each song the same? (Of course, the back is certainly very different) in fact, before the beginning of the music period, there is no strong point of energy, and because our 44100 sample rate is high, it will lead to a lot of repetition, don't worry.
How are we going to match it? We can directly search for the same number of audio fingerprints, but this loses the sequence we said before, we have to use the time series. Otherwise, the longer a song is, the easier it will be to match it, a song that is as crazy as a weed occupies the first place in all the results of the search audio rankings. and theoretically, the information contained in the audio is embodied in the sequence, just as a sentence is based on each phrase and words in a certain order to express its own meaning. Simply looking at the number of lexical overlaps in two sentences is completely unable to determine whether two sentences are similar. We are using the following algorithm, but we are only experimental code, the algorithm design is very simple, inefficient. Suggested that students who want to do better results can use the improved DTW algorithm.
In the matching process, we slide the fingerprint sequence, each time compared to the pattern string and the corresponding substring of the source string, if the corresponding position of the fingerprint is the same, then the comparison of the similar value plus one, we put the sliding process to get the maximum similarity of the two songs as the similarity.
Example:
The fingerprint sequence of a song in music library: [Fp13, Fp20, Fp10, Fp29, FP14, Fp25, Fp13, Fp13, Fp20, FP33, FP14]
Retrieve the fingerprint sequence of music: [Fp14, Fp25, FP13, FP17]
Compared to the process:
The final match has a similar value of 3
Store implementation code for the retrieval part
# coding=utf-8 import OS import mysqldb Import My_audio class memory (): Def __init__ (self, host, port, user, passwd , db): ' Initialize storage class:p Aram Host: Host location:p Aram Port: Ports:p Aram User: Username:p aram passwd: Password:p Aram DB: Database name ' SE Lf.host = Host Self.port = Port Self.user = User SELF.PASSWD = passwd self.db = db def addsong (self, Path): " ' Add a song method that extracts the song from the specified path and puts it in the database:p Aram path: return: ' If Type (path)!= str:print ' path need string ' RET Urn None basename = os.path.basename (path) Try:conn = MySQLdb.connect (Host=self.host, Port=self.port, user=self.us Er, passwd=self.passwd, db=self.db, charset= ' UTF8 ') # Create a connection to the database Except:print ' DB error ' return None cur = conn.cursor () Namecount = Cur.execute ("select * from fingerprint.musicdata WHERE song_name = '%s '"% Basena
ME) # Query if the newly added song is already in the gallery if Namecount > 0:print ' The song has been record! ' return None v = my_audio.voice () v.loaddata (path) v. FFT () Cur.execute ("INSERT INTO Fingerprint.musicdata VALUES ('%s ', '%s ')"% (basename, v.high_point.__str__ ()) # Add the new song's
Name and fingerprint to the database Conn.commit () cur.close () conn.close () def fp_compare (self, SEARCH_FP, MATCH_FP): "Fingerprint comparison method." :p Aram search_fp: Query fingerprint:p Aram MATCH_FP: library fingerprint: return: Maximum similarity value float ' If Len (SEARCH_FP) > Len (MATCH_FP): Retu RN 0 max_similar = 0 Search_fp_len = Len (search_fp) Match_fp_len = Len (MATCH_FP) for I in range (Match_fp_len-sea Rch_fp_len): temp = 0 for J in Range (Search_fp_len): If Match_fp[i + j] = + Search_fp[j]: temp = 1 if te MP > Max_similar:max_similar = temp return max_similar def search (self, Path): "Retrieves:p Aram path from the database: Path for audio to retrieve: return: Returns the list, the element is a two-tuple, the first item is a matching similarity, the second is the song name ' ' V = My_audio.voice () v.loaddata (path) v.fft () Try:co nn = MySQLdb.connect (Host=self.host, Port=self.port, User=self.user, passwd=self.passwd, db=self.db, charset= ' UTF 8 ') Except:print ' DaTabase error ' return to None cur = conn.cursor () cur.execute ("SELECT * from Fingerprint.musicdata") result = Cur.fet Chall () Compare_res = [] for i in Result:compare_res.append ((Self.fp_compare (v.high_point[:-1), eval (i[1)), i[0]) ) Compare_res.sort (Reverse=true) cur.close () conn.close () print compare_res return compare_res def search_and_ Play (Self, Path): "As in the previous method, but adds the function that will search out the best result directly:p Aram path: With retrieve song path: return: ' V = my_audio.voice () v . LoadData (path) v.fft () # print V.high_point try:conn = MySQLdb.connect (Host=self.host, Port=self.port, User=sel F.user, PASSWD=SELF.PASSWD, db=self.db, charset= ' UTF8 ') except:print ' DataBase error ' return None cur = Conn.cursor () cur.execute ("SELECT * from fingerprint.musicdata") result = Cur.fetchall () compare_res = [] for I In Result:compare_res.append ((Self.fp_compare (v.high_point[:-1), eval (i[1)), i[0]) Compare_res.sort (reverse=true ) Cur.close () Conn.cloSE () Print compare_res v.play (compare_res[0][1]) return compare_res if __name__ = ' __main__ ': SSS = memory (' Loca Lhost ', 3306, ' root ', ' root ', ' fingerprint ') sss.addsong (' Taiyangzhaochangshengqi.wav ') Sss.addsong (' Beiyiwangdeshiguang.wav ') sss.addsong (' Xiaozezhenger.wav ') sss.addsong (' Nverqing.wav ') sss.addsong (' The_mess.wav ') ) Sss.addsong (' Windmill.wav ') sss.addsong (' End_of_world.wav ') sss.addsong (' Pianai.wav ') sss.search_and_play ('
Record_beiyiwang.wav ')
Summarize
Our experiment is very rough in many places, and the core algorithm is the idea of "fingerprint" drawn from the algorithm proposed by Shazam company. I hope readers can make valuable suggestions.
This article is reproduced in: http://www.cnblogs.com/chuxiuhong/p/6063602.html