Listening to songs-using python to implement the function of a music searcher-python
Listen to songs and recognize songs. As the name suggests, use a device to "listen" The Song and tell you what it is. In, it has to play this song for you. This feature has long been available in QQ music and other applications. Today, we are here to make our own songs and music.
The overall flowchart we designed is very simple:
-----
Recording part
-----
To "listen", we must first start the recording process. In our experiment, our library also uses our recording code for recording, and then extracts features and stores them in the database. We use the following ideas for recording
# Coding = utf8import waveimport pyaudioclass recode (): def recode (self, CHUNK = 44100, FORMAT = pyaudio. paInt16, CHANNELS = 2, RATE = 44100, RECORD_SECONDS = 200, WAVE_OUTPUT_FILENAME = "record.wav"): ''': param CHUNK: buffer size: param FORMAT: Sample Size: param CHANNELS: number of channels: param RATE: Sampling RATE: param RECORD_SECONDS: recording time: param WAVE_OUTPUT_FILENAME: output file path: return: ''' p = pyaudio. pyAudio () stream = p. open (format = FORMAT, channels = CHANNELS, rate = RATE, input = True, frames_per_buffer = CHUNK) frames = [] for I in range (0, int (RATE/CHUNK * RECORD_SECONDS): data = stream. read (CHUNK) frames. append (data) stream. stop_stream () stream. close () p. terminate () wf = wave. open (WAVE_OUTPUT_FILENAME, 'wb ') wf. setnchannels (CHANNELS) wf. setsampwidth (p. get_sample_size (FORMAT) wf. setframerate (RATE) wf. writeframes (''. join (frames) wf. close () if _ name _ = '_ main _': a = recode (). recode (RECORD_SECONDS = 30, WAVE_OUTPUT_FILENAME='record_pianai.wav ')
What is the form of the song we have recorded?
If you only look at one sound channel, it is a one-dimensional array, which probably looks like this.
We draw the image from the horizontal axis based on the index value, which is the audio format we often see.
Audio Processing
Here we will write our core code. The key "how to recognize Songs ". Think about how we humans distinguish between songs? Does it rely on a one-dimensional array like above? Does it rely on the sound of songs? None.
We use a sequence of unique frequencies that our ears hear to remember a song. Therefore, if we want to write a song recognition song, we have to write an article on the audio frequency sequence.
Review what Fourier transform is. Although the blogger's "signal and system" class is quite watery, although the course did not write down the specific form of change, there is still a perceptual understanding.
The essence of Fourier transform is to convert the time-domain signal into a frequency-domain signal. That is to say, the original X and Y axes are our array subscript and array elements, and now they become frequency (so inaccurate, but it is correct here) and the component size at this frequency.
How can we understand the frequency domain? For those who are not familiar with signal processing, the most important thing is to change their understanding of the audio structure. We thought that the audio is just like the waveform we first gave. There is an amplitude at each time, and different amplitude sequences constitute our specific sound. Now, we believe that sound is a mixture of different frequency signals, and each of their signals exists from start to end. And they contribute according to their projection components.
Let's take a look at how a song is converted into a frequency domain?
We can see that the components of these frequencies are not average, and the difference is very large. To a certain extent, we can think that the highlighted peak value in the figure is a frequency signal with high output energy, which represents a high position in the audio. So we chose this signal to extract the features of the song.
But don't forget, what we mentioned previously is the frequency sequence. In Fourier transform, we can only know the frequency information of the whole song, so we will lose the time relationship, we can't talk about the "sequence. Therefore, we use a relatively discounted method to divide the audio into small pieces by time. Here, I separate 40 pieces per second.
Here is a question: why should we use small blocks instead of one such block per second?
We perform Fourier transformation on each block, modulo it, and obtain arrays. The following values are (0, 40), (40, 80), (80,120), and (120,180) respectively. The subscripts with the maximum modulo length are used to synthesize a quad-tuple, this is our core audio fingerprint ".
The extracted "fingerprint" is similar to the following:
(39, 65,110,131), (15, 66,108,161), (3, 63,118,146), (11, 62, 82,158), (15, 41, 95,140), (2, 71,106,143 ), (15, 44, 80,133), (36, 43, 80,135), (22, 58, 80,120), (29, 52, 89,126), (15, 59, 89,126 ), (37, 59, 89,126), (37, 59, 89,126), (37, 67,119,126)
There are three audio processing methods: loading data, Fourier transform, and playing music.
As follows:
# Coding = utf8import osimport reimport waveimport numpy as npimport pyaudioclass voice (): def loaddata (self, filepath): ''': param filepath: file path, wav file: return: if no exception exists, True is returned. If an exception exits and False is returned, self.wave_datastores audio frequency data with multiple channels. In this case, self.wav e_data [0] indicates the number of channels in the first channel. View self. nchannels ''' if type (filepath )! = Str: print 'The type of filepath must be string' return False p1 = re. compile ('\. wav ') if p1.findall (filepath) is None: print 'the suffix of file must be. wav 'Return False try: f = wave. open (filepath, 'rb') params = f. getparams () self. nchannels, self. sampwidth, self. framerate, self. nframes = params [: 4] str_data = f. readframes (self. nframes) self.wav e_data = np. fromstring (str_data, dtype = np. short) Self.wav e_data.shape =-1, self. sampwidth self.wav e_data = self.wav e_data.T f. close () self. name = OS. path. basename (filepath) # record the File name return True failed T: print 'file Error! 'Def fft (self, frames = 40): ''': param frames: frames indicates the number of parts per second: return: '''block = [] fft_blocks = [] self. high_point = [] blocks_size = self. framerate/frames # block_size indicates the number of frames in each block. blocks_num = self. nframes/blocks_size # Number of audio parts in xrange (0, len(self.wav e_data [0])-blocks_size, blocks_size): block.append(self.wav e_data [0] [I: I + blocks_size]) fft_blocks.append (np. abs (np. fft. fft (self. Wave_data [0] [I: I + blocks_size]) self. high_point.append (np. argmax (fft_blocks [-1] [: 40]), np. argmax (fft_blocks [-1] [40: 80]) + 40, np. argmax (fft_blocks [-1] [80: 120]) + 80, np. argmax (fft_blocks [-1] [120: 180]) + 120, # np. argmax (fft_blocks [-1] [180: 300]) + 180,) # key steps for fingerprint extraction. The last one is not obtained, but this one is retained, can you think about why? Def play (self, filepath): ''' Method for playing audio: param filepath: file path: return: ''' chunk = 1024 wf = wave. open (filepath, 'rb') p = pyaudio. pyAudio () # Open the audio output stream = p. open (format = p. get_format_from_width (wf. getsampwidth (), channels = wf. getnchannels (), rate = wf. getframerate (), output = True) # Write the audio output stream for playback while True: data = wf. readframes (chunk) if data = "": break stream. write (data) stream. close () p. terminate () if _ name _ = '_ main _': p = voice () p.loaddata('record_beiyiwang.wav ') p. fft ()
Self. high_point is the core data of future applications. List type. The elements in the list are in the fingerprint format described above.
Data storage and retrieval
Because we have prepared a database in advance to wait for retrieval, we must have a corresponding persistence method. I am using the mysql database to store the fingerprints corresponding to our songs. This has the advantage of saving time for coding.
We store fingerprints and songs in the following format:
By the way: why are the fingerprints of the first few songs the same? (Of course, there must be a lot of difference in the future) In fact, there is no strong energy point in the period before the music starts, and because we have a high sampling rate of 44100, this will lead to many duplicates at the beginning. Don't worry.
How can we perform matching? We can directly search for the same number of audio fingerprints, but this causes loss of the sequence we mentioned earlier. We must use the time series. Otherwise, the longer a song is, the more likely it will be matched. Such a song, like a wild grass, occupies the first place in the rankings of all search audio results. Furthermore, theoretically, the information contained in the audio is embodied in the sequence, just like a sentence can express its meaning in a certain order by phrases and words. Simply looking at the number of overlapping words in two sentences is totally different from determining whether the two sentences are similar. We use the following algorithms, but we only use experimental code. The algorithm design is simple and inefficient. We recommend that you use the improved DTW algorithm for better results.
We slide the fingerprint sequence during the matching process. Each comparison mode string and the corresponding substring of the source string are the same. If the fingerprint at the corresponding position is the same, the comparison similarity value is increased by one, we take the maximum similarity value obtained during the sliding process as the similarity between the two songs.
Example:
The fingerprint sequence of a piece in the Library: [fp13, fp20, fp10, fp29, fp14, fp25, fp13, fp13, fp20, fp33, fp14]
Fingerprint sequence of music retrieval: [fp14, fp25, fp13, fp17]
Comparison process:
The final matching similarity value is 3
Code for storing the retrieval part
# Coding = utf-8import osimport MySQLdbimport my_audioclass memory (): def _ init _ (self, host, port, user, passwd, db): ''' initial Storage Class: param host: host location: param port: param user: user name: param passwd: Password: param db: Database Name ''' self. host = host self. port = port self. user = user self. passwd = passwd self. db = db def addsong (self, path): ''' Add the song method, extract the fingerprint of the song in the specified path, and put it to the database: param path: return: '''if type (path )! = Str: print 'path need string' return None basename = OS. path. basename (path) try: conn = MySQLdb. connect (host = self. host, port = self. port, user = self. user, passwd = self. passwd, db = self. db, charset = 'utf8') # create a connection to the DataBase. Failed T: print 'database error' return None cur = conn. cursor () namecount = cur.exe cute ("select * from fingerprint. musicdata WHERE song_name = '% S' "% basename) # query whether the newly added song is in the library if name Count> 0: print 'The song has been record! 'Return None v = my_audio.voice () v. loaddata (path) v. fft () cur.exe cute ("insert into fingerprint. musicdata VALUES ('% s',' % s') "% (basename, v. high_point. _ str _ () # store the name and fingerprint of the new song in the database. commit () cur. close () conn. close () def fp_compare (self, search_fp, match_fp): ''' fingerprint comparison method.: Param search_fp: Query fingerprint: param match_fp: fingerprint in the database: return: maximum similarity value float ''' if len (search_fp)> len (match_fp ): return 0 max_similar = 0 search_fp_len = len (search_fp) match_fp_len = len (match_fp) for I in range (match_fp_len-search_fp_len): temp = 0 for j in range (search_fp_len ): if match_fp [I + j] = search_fp [j]: temp + = 1 if temp> max_similar: max_similar = temp return max_similar def search (self, path ): '''from the database: param path: path of the audio to be retrieved: return list. The element is a binary group. The first item is the matching similarity value, the second item is the song name ''' v = my_audio.voice () v. loaddata (path) v. fft () try: conn = MySQLdb. connect (host = self. host, port = self. port, user = self. user, passwd = self. passwd, db = self. db, charset = 'utf8') failed T: print 'database error' return None cur = conn. cursor () cur.exe cute ("SELECT * FROM fingerprint. musicdata ") result = cur. fetchall () compare_res = [] for I in result: compare_res.append (self. fp_compare (v. high_point [:-1], eval (I [1]), I [0]) compare_res.sort (reverse = True) cur. close () conn. close () print compare_res return compare_res def search_and_play (self, path): ''' is the same as the previous method, but the function of directly playing the optimal search results is added: param path: path to the retrieved Song: return: ''' v = my_audio.voice () v. loaddata (path) v. fft () # print v. high_point try: conn = MySQLdb. connect (host = self. host, port = self. port, user = self. user, passwd = self. passwd, db = self. db, charset = 'utf8') failed T: print 'database error' return None cur = conn. cursor () cur.exe cute ("SELECT * FROM fingerprint. musicdata ") result = cur. fetchall () compare_res = [] for I in result: compare_res.append (self. fp_compare (v. high_point [:-1], eval (I [1]), I [0]) compare_res.sort (reverse = True) cur. close () conn. close () print compare_res v. play (compare_res [0] [1]) return compare_resif _ name _ = '_ main _': sss = memory ('localhost', 3306, 'root', 'root', 'fingerprint ') then') sss.addsong('the_mess.wav ') sss.addsong('windmill.wav') then ')
Summary
Our experiment is rough in many places. The core algorithm is the idea of "fingerprint" derived from the algorithm proposed by shazam. Hope readers can give valuable suggestions.
This article is reproduced in: http://www.cnblogs.com/chuxiuhong/p/6063602.html