In this article, we'll explore a neat way to visualize your MP3 music collection. The final result of this method will be a positive hexagonal grid map that maps all your songs, where similar tracks will be in the adjacent position. The colors of different regions correspond to different genres of music (e.g. classical, hip-hop, heavy rock). For example, here is a map of the three albums in my collection: Paganini's "violin Caprices", Eminem's "The Eminem Show" and Coldplay's "X&y".
To make it more interesting (and in some cases simpler), I imposed some restrictions. First, the solution should not rely on any existing ID3 tags in the MP3 file (for example, arist,genre), and should use only the statistical characteristics of the sound to calculate the similarity of the songs. Anyway, a lot of my MP3 file tags are bad, but I want to make the solution work for any music collection file, no matter how bad their metadata is. Second, you should not use other external information to create a visual image, you need to enter only the user's MP3 file set. In fact, by taking advantage of a large song database that has been marked as a particular genre, it can improve the effectiveness of the solution, but for simplicity, I want to keep this solution completely independent. Finally, although digital music has many formats (MP3, WMA, M4A, OGG, etc.), in order to make it simpler, I only focus on the MP3 file. In fact, the algorithm developed in this paper for other formats of audio can also work well, as long as this format audio can be converted to WAV format files.
Creating a music map is an interesting exercise that includes audio processing, machine learning, and visualization techniques. The basic steps are as follows:
Convert MP3 file to low bit rate WAV file.
Extracts statistical features from WAV metadata.
Finding an optimal subset of these features makes the adjacent songs in this feature space sound similar to each other.
In order to draw on an xy two-dimensional plane, the feature vector is mapped to a two-dimensional space using the dimensionality reduction technique.
Generates a hexagonal grid of dots, and then uses the nearest neighbor technique to map each song on the XY plane to a point on the hexagonal grid.
Back to the original high dimensional feature space, the song clustering to the user-defined number of groups (k=10 can achieve a good visual purposes). For each group, find the song closest to the group center.
On the hexagonal grid, use different colors to color the song in the K Group Center.
Insert a different color into each of the other songs, depending on the distance to the center of each group on the XY screen.
Next, let's take a look at some of the details of some of these steps together.
convert MP3 file to WAV format
The main advantage of converting our music files into WAV format is that we can easily read the data using the "wave" module in the Python standard library, making it easy to use NumPy to manipulate the data later. In addition, we will sample the sound files with a single channel 10kHz sampling rate, so that the computational complexity of extracting statistical features can be reduced. To handle conversion and down-sampling, I used the well-known MPG123, a free command-line MP3 player that can be easily invoked in Python. The following code recursively searches a music folder to find all the MP3 files, and then calls MPG123 to convert them to a temporary 10kHz wav file. Then, perform feature calculations on these WAV files (discussed in the next section).
Import subprocess Import wave import struct import numpy import csv import sys def read_wav (wav_file): "" Returns two
Chunks of sound data from wave file. "" " W = wave.open (wav_file) n = 10000 if w.getnframes () < n * 2:raise valueerror (' Wave file Too short ') frames = W.readframes (n) wav_data1 = Struct.unpack ('%dh '% n, frames) frames = W.readframes (n) wav_data2 = Struct.unpack ('%dh ' % n, frames) return wav_data1, Wav_data2 def compute_chunk_features (mp3_file): "" "return feature vectors for two chunk
S of an MP3 file. "" " # Extract MP3 file to a mono, 10kHz WAV file Mpg123_command = '. Mpg123-1.12.3-x86-64mpg123.exe-w "%s"-R 10000-m "%s" ' Out_file = ' temp.wav ' cmd = mpg123_command% (out_file, Mp3_fil E) temp = Subprocess.call (cmd) # Read in chunks to data from WAV file wav_data1, wav_data2 = Read_wav (out_file) # We ' L
L cover How the features are computed in the next section! Return features (WAV_DATA1), features (WAV_DATA2) # Main script starts Here # ======================= for Path, dirs, files in Os.walk (' c:/users/christian/music/'): For F in files:if not F.ENDSWI Th ('. mp3 '): # Skip any non-mp3 files continue mp3_file = Os.path.join (path, f) # Extract the track name (i.e. th E file name plus the names # of the two preceding directories.
This is useful # later for plotting. Tail, track = Os.path.split (mp3_file) tail, Dir1 = os.path.split (tail) tail, Dir2 = Os.path.split (tail) # Compute FE Atures. FEATURE_VEC1 and FEATURE_VEC2 are lists of floating # point numbers representing the statistical features we have EXTRAC
Ted # from the Raw sound data.
TRY:FEATURE_VEC1, feature_vec2 = Compute_chunk_features (mp3_file) except:continue
Feature Extraction
In Python, a single channel 10kHz waveform file is represented as a list of integers ranging from 254 to 255, and the sound contains 10,000 integers per second. Each integer represents the relative amplitude of the song at the corresponding point in time. We will extract a fragment of 60 seconds from each of the two songs separately, so each fragment will be represented by 600,000 integers. The function "Read_wav" in the code above returns the list of these integers. The following is a 10-second sound waveform extracted from some songs in Eminem's "The Eminem Show":
For comparison, here are some of the fragment waveforms in Paganini's "Violin Caprices":
As can be seen from the above two figures, the waveform structure of these fragments is very distinct, but generally see Eminem's song Waveform looks similar, "violin Caprices" song is the same. Next, we'll extract some statistical features from these waveforms that will capture the differences between the songs and then, by the similarity of the songs, we use machine learning techniques to group them.
The first set of feature sets we will extract is the statistical moment (mean, standard deviation, skewness and peak state) of the waveform. In addition to these calculations of amplitude, we will calculate the amplitude of the incremental smoothing to obtain the music characteristics at different time scales. I used smooth windows with lengths of 1, 10, 100, and 1000 samples, and of course other values could have been very good.
The amplitude is calculated by using the smooth window of all the above sizes respectively. In order to obtain the short time variation of the signal, I also calculated the statistical properties of the first order difference amplitude (smoothed).
The above features give a fairly comprehensive waveform statistic summary in the time domain, but it is also helpful to compute the characteristics of some frequency domains. The heavy bass music like hip hop has more energy in the low-frequency part, while the classical music occupies more proportion in the high-frequency part.
By putting these features together, we get 42 different features of each song. The following Python code calculates these characteristics from a series of amplitude values:
Def moments (x): mean = X.mean () std = X.var () **0.5 skewness = ((X-mean) **3). Mean ()/std**3 kurtosis = ((x-mean) * *4). Mean ()/std**4 return [mean, STD, skewness, kurtosis] def fftfeatures (wavdata): F = numpy.fft.fft (wavdata) F = f
[2: (F.SIZE/2 + 1)] F = ABS (f) Total_power = F.sum () F = Numpy.array_split (f,) return [E.sum ()/total_power for E in F] def features ( X): x = Numpy.array (x) f = [] xs = x diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = X.reshape ( -1). Mean (1) diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = x.reshape (-1,
Mean (1) diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = x.reshape ( -1, 1000). Mean (1) diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) f.extend (Fftfeatures (x)) return F # F would Being a list of floating point features with the following # names: # amp1mean # amp1std # amp1skew # amp1kurt # Amp1dme An
# amp1dstd # amp1dskew # amp1dkurt # amp10mean # amp10std Amp10skew # amp10kurt # amp10dmean AMP10DSTD # Amp10dskew # Amp10dkurt # Amp100mean # amp100std # amp100skew # amp100kurt # amp100dmean # amp100dstd # amp100dskew # amp100dkurt # AM P1000mean # amp1000std # amp1000skew # amp1000kurt # amp1000dmean # amp1000dstd # amp1000dskew # amp1000dkurt # power1 # p
Ower2 # power3 # power4 # power5 # power6 # power7 # power8 # power9 # power10
Select an optimal subset of features
We've calculated 42 different types of specialties, but not all of them help to determine whether two songs sound the same. The next step is to find an optimal subset of these features so that the Euclidean distance between the two eigenvectors in this reduced feature space can well correspond to the similarity of the two songs.
The process of variable selection is a supervised machine learning problem, so we need some training data sets which can guide the algorithm to find the best subset of variables. I'm not a training set that creates algorithms by manually processing music sets and labeling which songs sound similar. Instead, it uses a simpler approach: extract two of a 1-minute sample from each song and try to find an algorithm that best matches the two fragments in the same song.
To find a feature set that would achieve the best average match for all songs, I used a genetic algorithm (in the GENALG package in R language) to select each of the 42 variables. The following figure shows the 100 iterations of the genetic algorithm, the improvement of the objective function (for example, how stable the two sample fragments of a song match through the nearest neighbor classifier).
If we force the distance function to use all 42 features, then the value of the target function becomes 275. and by correctly using genetic algorithms to select feature variables, we have reduced the objective function (for example, error rate) to 90, which is a very significant improvement. The last selected optimal feature set includes:
Amp10mean
Amp10std
Amp10skew
Amp10dstd
Amp10dskew
Amp10dkurt
Amp100mean
Amp100std
Amp100dstd
Amp1000mean
Power2
Power3
Power4
Power5
Power6
Power7
Power8
Power9
Visualization of data in two-dimensional space
Our optimal feature set uses 18 feature variables to compare song similarity, but we want to eventually visualize the music set on the 2-D plane, so we need to reduce this 18-dimensional space to 2-dimensional so we can draw. To do this, I simply used the first two principal components as x and Y coordinates. Of course, this introduces some bugs into the visualization diagram, which may cause songs in the 18-D space to be similar in the 2-D plane. However, these errors are unavoidable, but fortunately they do not distort the relationship too much-sounding similar songs are still roughly clustered on the 2-D plane.
map points to a six-corner grid
The 2D points generated from the principal component are distributed irregularly on the plane. Although this irregular distribution describes the most "accurate" placement of the 18-D eigenvector on the 2-D plane, I still want to map them to a cool picture by sacrificing some accuracy, a hexagonal grid with regular spacing. The following actions are implemented:
Embeds the point of the XY plane into a larger hexagonal grid lattice.
Starting at the outermost point of the six angle, the nearest irregular interval is assigned the principal component point to each hexagonal grid point.
Extend the points of the 2D plane so that they are fully populated with hexagonal meshes, forming a compelling figure.
Coloring for graphs
One of the main purposes of this exercise is not to make any assumptions about the content of the music set. This means I don't want to assign predefined colors to a particular genre of music. Instead, I aggregate eigenvectors in 18-D space to find containers that gather music that sounds similar, and assign colors to these group centers. The result is an adaptive coloring algorithm that finds as much detail as you want (because the user can define the number of groups, i.e. the number of colors). As mentioned earlier, I find that the number of groups using k=10 tends to give good results.
Final Output
For fun, here's a visual picture of 3,668 songs in My Music collection. Full-resolution pictures can be obtained from here. If you zoom in on the image, you'll see that the algorithm works pretty well: the shaded area corresponds to the same music genre, and is often the same artist, as we hope.