A simple tutorial on drawing music scores using the wave module in the Python standard library, pythonwave
In this article, we will explore a simple way to visualize your MP3 music favorites. The final result of this method will be a hexagonal grid map mapped to all your songs, where similar audio tracks will be adjacent. The colors in different regions correspond to different Music Genres (such as classical, hip hop, and heavy rock ). For example, below is a map of the three albums in my favorite music: paganini's Violin Caprices, Eminem's The Eminem Show, and Coldplay's X & Y.
To make it more interesting (in some cases simpler), I imposed some restrictions. First, the solution should not rely on any existing ID3 tags (such as Arist and Genre) in MP3 files. It should only use the statistical characteristics of sound to calculate the similarity of songs. In any case, many of my MP3 file tags are terrible, but I want to make this solution suitable for any music favorites files, no matter how bad their metadata is. Second, you should not use other external information to create visual images. You only need to input your MP3 file set. In fact, using a large song database that has been marked as a specific genre can improve the effectiveness of the solution, but for simplicity, I want to keep this solution completely independent. Finally, although there are many formats of digital music (MP3, WMA, M4A, OGG, etc.), I only pay attention to MP3 files to simplify it. In fact, the algorithm developed in this article can work well for audio in other formats, as long as the audio in this format can be converted to WAV format files.
Creating music graphs is an interesting exercise that includes audio processing, machine learning, and visualization technologies. The basic steps are as follows:
Converts an MP3 file to a low-bit-rate WAV file.
Extract statistical features from WAV metadata.
Finding an Optimal subset of these features makes adjacent songs in this feature space sound similar.
To plot on an XY two-dimensional plane, dimensionality reduction is used to map feature vectors to two-dimensional spaces.
Generate a hexagonal mesh composed of points, and then map each song on the XY plane to a point on the hexagonal mesh using the nearest neighbor technology.
Return to the original high-dimensional feature space and cluster the songs into the user-defined number of groups (k = 10 can achieve good visualization ). For each group, find the song closest to the group center.
On the hexagonal mesh, colors the song in k groups in different colors.
Insert different colors to other songs based on their distance from the XY screen to the center of each group.
Next, let's take a look at the detailed information about some of the steps.
Convert MP3 files to WAV format
The main advantage of converting our music files into WAV format is that we can use the "wave" module in the Python standard library to easily read data and use NumPy to operate the data later. In addition, we will sample the audio file at a single-channel sampling rate of 10 kHz to reduce the computing complexity of extracting statistical features. To process the conversion and downsampling, I used the well-known MPG123, a free command line MP3 player that can be easily called in Python. The following code recursively searches a music folder to find all MP3 files and calls MPG123 to convert them to temporary 10 kHz WAV files. Then, perform feature Calculation on these WAV Files (discussed in the following section ).
import subprocessimport waveimport structimport numpyimport csvimport sys def read_wav(wav_file): """Returns two chunks of sound data from wave file.""" w = wave.open(wav_file) n = 60 * 10000 if w.getnframes() < n * 2: raise ValueError('Wave file too short') frames = w.readframes(n) wav_data1 = struct.unpack('%dh' % n, frames) frames = w.readframes(n) wav_data2 = struct.unpack('%dh' % n, frames) return wav_data1, wav_data2 def compute_chunk_features(mp3_file): """Return feature vectors for two chunks of an MP3 file.""" # Extract MP3 file to a mono, 10kHz WAV file mpg123_command = '..mpg123-1.12.3-x86-64mpg123.exe -w "%s" -r 10000 -m "%s"' out_file = 'temp.wav' cmd = mpg123_command % (out_file, mp3_file) temp = subprocess.call(cmd) # Read in chunks of data from WAV file wav_data1, wav_data2 = read_wav(out_file) # We'll cover how the features are computed in the next section! return features(wav_data1), features(wav_data2) # Main script starts here# ======================= for path, dirs, files in os.walk('C:/Users/Christian/Music/'): for f in files: if not f.endswith('.mp3'): # Skip any non-MP3 files continue mp3_file = os.path.join(path, f) # Extract the track name (i.e. the file name) plus the names # of the two preceding directories. This will be useful # later for plotting. tail, track = os.path.split(mp3_file) tail, dir1 = os.path.split(tail) tail, dir2 = os.path.split(tail) # Compute features. feature_vec1 and feature_vec2 are lists of floating # point numbers representing the statistical features we have extracted # from the raw sound data. try: feature_vec1, feature_vec2 = compute_chunk_features(mp3_file) except: continue
Feature Extraction
In Python, a 10 kHz waveform file of a single channel is represented as an integer list ranging from-254 to 255. The sound per second contains 10000 integers. Each integer represents the relative amplitude of a song at the corresponding time point. We will extract a 60-second clip from the two songs respectively, so each clip will be represented by 600000 integers. The function "read_wav" in the above Code returns the list of these integers. The following figure shows The 10-second sound waveform extracted from some songs in The Eminem Show:
For comparison, the following is a sample waveform in Violin Caprices of Paganini:
From the two figures above, we can see that the waveform structure of these fragments is quite different, but generally, the song waveforms of Eminem seem to be somewhat similar, as are the songs of Violin Caprices. Next, we will extract some statistical features from these waveforms, which will capture the differences between the songs, and then use machine learning technology to group them based on the similarity of these songs.
The first feature set we will extract is the statistical moment of the waveform (mean, standard deviation, partial state and peak state ). In addition to calculating the amplitude, we also calculate the amplitude after the incremental smoothing to obtain the music features at different time scales. I used a smooth window with a length of 1, 10, 100, and 1000 samples. Of course, other values may also produce good results.
Calculate the amplitude by using the smooth windows of all sizes above. In order to obtain the short-term variation of the signal, I also calculated the statistical characteristics of the first-order differential amplitude (smoothed.
The above features provide a fairly comprehensive waveform statistical summary in the time domain, but it is also helpful to calculate features in some frequency domains. Bass music such as hip hop has more energy in the low-frequency part, while classic music occupies more proportion in the high-frequency part.
By combining these features, we get 42 different features of each song. The following Python code calculates these features from a series of amplitude values:
def moments(x): mean = x.mean() std = x.var()**0.5 skewness = ((x - mean)**3).mean() / std**3 kurtosis = ((x - mean)**4).mean() / std**4 return [mean, std, skewness, kurtosis] def fftfeatures(wavdata): f = numpy.fft.fft(wavdata) f = f[2:(f.size / 2 + 1)] f = abs(f) total_power = f.sum() f = numpy.array_split(f, 10) return [e.sum() / total_power for e in f] def features(x): x = numpy.array(x) f = [] xs = x diff = xs[1:] - xs[:-1] f.extend(moments(xs)) f.extend(moments(diff)) xs = x.reshape(-1, 10).mean(1) diff = xs[1:] - xs[:-1] f.extend(moments(xs)) f.extend(moments(diff)) xs = x.reshape(-1, 100).mean(1) diff = xs[1:] - xs[:-1] f.extend(moments(xs)) f.extend(moments(diff)) xs = x.reshape(-1, 1000).mean(1) diff = xs[1:] - xs[:-1] f.extend(moments(xs)) f.extend(moments(diff)) f.extend(fftfeatures(x)) return f # f will be a list of 42 floating point features with the following# names: # amp1mean# amp1std# amp1skew# amp1kurt# amp1dmean# amp1dstd# amp1dskew# amp1dkurt# amp10mean# amp10std# amp10skew# amp10kurt# amp10dmean# amp10dstd# amp10dskew# amp10dkurt# amp100mean# amp100std# amp100skew# amp100kurt# amp100dmean# amp100dstd# amp100dskew# amp100dkurt# amp1000mean# amp1000std# amp1000skew# amp1000kurt# amp1000dmean# amp1000dstd# amp1000dskew# amp1000dkurt# power1# power2# power3# power4# power5# power6# power7# power8# power9# power10
Select an optimal feature subset
We have calculated 42 different features, but not all features help to determine whether the two songs sound the same. The next step is to find an optimal subset of these features so that the Euclidean distance between the two feature vectors in this reduced feature space can well correspond to the similarity of the two songs.
Variable Selection is a supervised machine learning problem. Therefore, we need some training data sets that can guide algorithms to find the best subset of variables. Instead of creating an algorithm training set by manually processing a music set and marking which songs sound similar, I used a simpler method: extract two samples with a length of 1 minute from each song and try to find an algorithm that best matches the two segments in the same song.
To find the feature set for all the songs that can achieve the best average matching, I used a genetic algorithm (in the genalg package of the R language) to select each of the 42 variables. Shows the improvement of the target function after 100 iterations of the genetic algorithm (for example, how stable the two sample fragments of a song match with the nearest neighbor classifier ).
If we force all 42 features to be used by the function, the value of the target function will change to 275. By correctly using the genetic algorithm to select feature variables, we have reduced the target function (for example, the error rate) to 90, which is a significant improvement. The final selection of the optimal feature set includes:
Amp10mean
Amp10std
Amp10skew
Amp10dstd
Amp10dskew
Amp10dkurt
Amp100mean
Amp100std
Amp100dstd
Amp1000mean
Power2
Power3
Power4
Power5
Power6
Power7
Power8
Power9
Visualize data in two-dimensional space
Our optimal feature set uses 18 feature variables to compare the similarity of songs, but we want to ultimately visualize the music set on a 2-dimensional plane, therefore, we need to reduce the 18-dimensional space to 2-dimensional space to facilitate our painting. To achieve this, I simply used the first two principal components as the X and Y coordinates. Of course, this will introduce some errors to the visualization map, which may cause some similar songs in the 18-dimensional space to be no longer close in the 2-dimensional plane. However, these errors are inevitable, but fortunately they will not distort the relationship too much-similar songs will still be roughly merged on the 2-dimensional plane.
Map points to a hexagonal grid
The 2D points generated from the principal component are irregularly distributed on the plane. Although this irregular distribution describes the most "accurate" arrangement of 18-dimensional feature vectors on a 2-dimensional plane, but I still want to map them to a cool picture at the expense of some accuracy, that is, a hexagonal mesh with regular intervals. Perform the following operations:
Embed points in the xy plane into a larger hexagonal lattice.
From the points in the outermost hexagonal layer, the principal component points at the recent irregular intervals are allocated to each hexagonal network point.
Points that extend the 2D plane fill the hexagonal mesh to form a striking figure.
Color A Graph
One of the main purposes of this exercise is not to make any assumptions about the content of the music set. This means that I don't want to assign a predefined color to a particular music genre. Instead, I aggregate feature vectors in an 18-dimensional space to find containers that aggregate music that sounds similar and assign colors to these group centers. The result is an adaptive coloring algorithm that identifies as many details as possible (because you can define the number of groups, that is, the number of colors ). As mentioned above, I found that the number of groups using k = 10 often gives good results.
Final output
For the sake of entertainment, a visual chart of 3668 songs in my music set is provided here. Full-resolution images can be obtained from here. If you zoom in on the image, you will see that the algorithm works quite well: the colored areas correspond to sound tracks of the same music genre and are often the same artist, as we hope.