A simple tutorial for drawing scores using the Wave module in the Python standard library

Source: Internet
Author: User
Tags diff unpack
In this article, we'll explore a neat way to visualize your MP3 music collection. The end result of this method will be a positive hexagon grid map that maps all your songs, where similar tracks will be in adjacent locations. Different areas of color correspond to different genres of music (for example: Classical, hip-hop, heavy rock). For example, here is a map of three albums in My Music Collection: Paganini's Violin Caprices, Eminem's Eminem Show and Coldplay's X&y.

To make it more interesting (and in some cases simpler), I imposed some restrictions. First, the solution should not rely on any existing ID3 tags in the MP3 file (for example, arist,genre), and should only use the statistical characteristics of the sound to calculate the similarity of the songs. Anyway, a lot of my MP3 file tags are bad, but I want to make this solution work for any music collection file, no matter how bad their metadata is. Second, you should not use other external information to create a visual image, only the user's MP3 file set. In fact, the effectiveness of the solution can be improved by using a large song database that has been labeled as a particular genre, but I want to keep this solution completely independent for the sake of simplicity. Finally, although there are many formats for digital music (MP3, WMA, M4A, OGG, etc.), in order to make it simple, here I just focus on the MP3 file. In fact, the algorithm developed in this paper can work well for other formats as long as the audio in this format can be converted to WAV format files.

Creating a music Atlas is an interesting exercise that includes audio processing, machine learning, and visualization techniques. The basic steps are as follows:

Convert the MP3 file to a low bit rate WAV file.
Extracts statistical features from the WAV metadata.
Find an optimal subset of these features so that the adjacent songs in this feature space will sound similar to each other.
To draw on a two-dimensional XY plane, a feature vector is mapped to a two-dimensional space using a dimensionality reduction technique.
Generates a hexagonal grid of dots, and then uses nearest neighbor technology to map each song on the XY plane to a point on the hexagonal grid.
Back to the original high-dimensional feature space, the songs are clustered into a user-defined number of groups (k=10 can do a good job of visualizing them). For each group, find the song that is closest to the group center.
On the hexagon grid, use different colors to color the song in the K Group Center.
Insert different colors on the XY screen, depending on the distance from the other songs to each group center.

Below, let's take a look at some of these steps for more information.
Convert MP3 files to WAV format

The main advantage of converting our music files into WAV format is that we can easily read the data using the "wave" module in the Python standard library, allowing you to use NumPy to manipulate the data later. In addition, we will sample the sound files at the sampling rate of mono-channel 10kHz, so that the computational complexity of extracting statistical features is reduced. To handle conversions and down-sampling, I used the well-known MPG123, a free command-line MP3 player that can be easily invoked in Python. The following code recursively searches a music folder to find all the MP3 files, and then calls MPG123 to convert them to temporary 10kHz wav files. These WAV files are then evaluated for features (discussed in the next section).

Import subprocessimport waveimport structimport numpyimport csvimport sys def read_wav (wav_file): "" "Returns both chunks of Sound data from wave file. "" W = Wave.open (wav_file) n = $ * 10000 if W.getnframes () < n * 2:raise ValueError (' wav E file too short ') frames = W.readframes (n) wav_data1 = Struct.unpack ('%dh '% n, frames) frames = W.readframes (n) wav_data 2 = Struct.unpack ('%dh '% n, frames) return wav_data1, Wav_data2 def compute_chunk_features (mp3_file): "" return feature V Ectors for the chunks of an MP3 file. "" "# Extract MP3 file to a mono, 10kHz WAV file Mpg123_command = '.  Mpg123-1.12.3-x86-64mpg123.exe-w "%s"-R 10000-m "%s" ' Out_file = ' temp.wav ' cmd = mpg123_command% (out_file, mp3_file) temp = Subprocess.call (cmd) # Read in chunks of the data from WAV file wav_data1, wav_data2 = Read_wav (out_file) # We LL Cove R how the features is computed in the next section! Return features (WAV_DATA1), features (WAV_DATA2) # Main script starts here# ======================= for Path,dirs, files in Os.walk (' c:/users/christian/music/'): For F in files:if not F.endswith ('. mp3 '): # Skip any Non-mp3 file  s Continue mp3_file = Os.path.join (path, f) # Extract the track name (i.e. the file name) plus the names # of the Preceding directories.  This would be a useful # later for plotting. Tail, track = Os.path.split (mp3_file) tail, Dir1 = os.path.split (tail) tail, Dir2 = Os.path.split (tail) # Compute Featu Res. FEATURE_VEC1 and FEATURE_VEC2 is lists of floating # point numbers representing the statistical features we have EX  Tracted # from the raw sound data. TRY:FEATURE_VEC1, feature_vec2 = Compute_chunk_features (mp3_file) except:continue

Feature Extraction

In Python, a mono 10kHz waveform file is represented as a list of integers ranging from 254 to 255, with a sound containing 10,000 integers per second. Each integer represents the relative amplitude of the song at the corresponding point in time. We will separately extract a fragment of 60 seconds from each of the two songs, so each fragment will be represented by 600,000 integers. The function "Read_wav" in the code above returns a list of these integers. Here is a 10-second sound waveform drawn from some of the songs in Eminem's Eminem Show:

For the sake of comparison, here are some of the fragment waveform graphs in Paganini's "Violin Caprices":


As can be seen from the above two figures, the waveform structure of these fragments is very obvious, but generally Eminem's song waveform diagram looks similar, "Violin Caprices" song is also the case. Next, we'll extract some statistical features from these waveforms, which will capture the differences between songs, and then use machine learning techniques to group them through the similarities of the songs that sound.

The first set of feature sets we will extract are the statistical moments (mean, standard deviation, skewness, and Peak state) of the waveform. In addition to the amplitude of these calculations, we will also calculate the amplitude of the increment smoothed to obtain different time scales of music characteristics. I used a smooth window of 1, 10, 100, and 1000 samples of the length, and of course the other values would have yielded good results.

The amplitude is calculated by using the Smoothing window of all the above sizes respectively. To obtain the short-time variation of the signal, I also calculated the statistical characteristics of the first-order differential amplitude (smoothed).

The above features give a fairly comprehensive summary of the waveform statistics in the time domain, but it is also helpful to calculate the characteristics of some frequency domains. Like hip-hop this bass music has more energy in the low-frequency part, while classical music occupies more proportion in the high-frequency part.

By putting these features together, we get 42 different features of each song. The following Python code calculates these characteristics from a range of amplitude values:

Def moments (x): mean = X.mean () std = X.var () **0.5 skewness = ((X-mean) **3). Mean ()/std**3 kurtosis = ((X-mean) **4). m  EAN ()/std**4 return [mean, STD, skewness, kurtosis] def fftfeatures (wavdata): F = numpy.fft.fft (wavdata) F = f[2: (f.size /2 + 1)] F = ABS (f) Total_power = F.sum () F = Numpy.array_split (f, ten) return [E.sum ()/total_power for E in F] def FEA Tures (x): x = Numpy.array (x) f = [] xs = x diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = x. Reshape ( -1, ten). Mean (1) diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = X.reshape ( -1, +). Mea  N (1) diff = xs[1:]-xs[:-1] f.extend (Moments (XS)) F.extend (Moments (diff)) xs = X.reshape ( -1, +). Mean (1) diff = xs[1:] -Xs[:-1] F.extend (Moments (XS)) F.extend (Moments (diff)) f.extend (Fftfeatures (x)) return F # F'll be a list of the float ING point features with the following# names: # amp1mean# amp1std# amp1skew# amp1kurt# amp1dmean# amp1dstd# amp1dskew# amp 1dkurt# amp10mean# AMP10std# amp10skew# amp10kurt# amp10dmean# amp10dstd# amp10dskew# amp10dkurt# amp100mean# amp100std# amp100skew# amp100kurt# amp100dmean# amp100dstd# amp100dskew# amp100dkurt# amp1000mean# amp1000std# amp1000skew# amp1000kurt# amp1000dmean# amp1000dstd# amp1000dskew# amp1000dkurt# power1# power2# power3# power4# power5# power6# power7# power8# power9# power10

Select an optimal subset of features

We've calculated 42 different specialties, but not all of them will help you determine whether two songs sound the same. The next step is to find an optimal subset of these features so that the Euclidean distance between the two eigenvectors in this reduced feature space is a good match for the similarity of the two song sounds.

The process of variable selection is a supervised machine learning problem, so we need some training data sets that can guide the algorithm to find the best subset of variables. Instead of creating the algorithm's training set by manually processing the music set and tagging which songs sounded similar, I used an easier approach: extracting two samples of 1 minutes from each song, and then trying to find an algorithm that best matches the two fragments in the same song.

To find a feature set that would achieve the best average match for all songs, I used a genetic algorithm (in the GENALG package of the R language) to select each of the 42 variables. Shows the 100 iterations of the genetic algorithm, the improvement of the objective function (for example, two sample fragments of a song are matched by the nearest neighbor classifier to exactly how stable).

If we force the distance function to use all 42 features, then the value of the target function becomes 275. By correctly using genetic algorithms to select feature variables, we have reduced the target function (for example, error rate) to 90, which is a significant improvement. The final selection of the optimal feature set includes:

Amp10mean
Amp10std
Amp10skew
Amp10dstd
Amp10dskew
Amp10dkurt
Amp100mean
Amp100std
Amp100dstd
Amp1000mean
Power2
Power3
Power4
Power5
Power6
Power7
Power8
Power9

Visualize data in two-dimensional space

Our optimal feature set uses 18 feature variables to compare the similarity of songs, but we want to finally visualize the collection of music on a 2-dimensional plane, so we need to reduce this 18-dimensional space to 2-dimensional so we can paint. To achieve this, I simply used the first two principal components as x and Y coordinates. Of course, this introduces some errors into the visualization, which may cause some songs that are similar in 18-dimensional space to be no more similar in the 2-dimensional plane. However, these errors are unavoidable, but fortunately they do not distort the relationship too much-sounds similar to the songs that are still roughly clustered on the 2-dimensional plane.
Map points to a hexagonal grid

The 2D points generated from the main component are irregularly distributed on the plane. Although this irregular distribution describes the most "accurate" placement of 18-dimensional eigenvectors on 2-dimensional planes, I still want to map them to a cool picture by sacrificing some accuracy, a hexagonal grid with regular intervals. The following actions are implemented:

The points of the XY plane are embedded in a larger hexagonal grid lattice.
Starts at the outermost point of the six-point shape, and assigns the main component points of the most recent irregular interval to each hexagonal grid point.
Extend the points of the 2D plane so that they fully fill the hexagonal mesh, forming a compelling figure.

Coloring a diagram

One of the main purposes of this exercise is to not make any assumptions about the content of the music collection. This means I don't want to assign predefined colors to a particular genre of music. Instead, I'm aggregating eigenvectors in 18-dimensional space to find containers that gather music that sounds similar, and assign colors to these group centers. The result is an adaptive coloring algorithm that identifies as much detail as you require (because the user can define the number of groups, or the number of colors). As mentioned earlier, I find that the number of groups using k=10 often gives good results.
Final output

For fun, here's a visualization of 3,668 songs in My Music collection. Full-resolution images can be obtained from here. If you zoom in on the image, you'll see that the algorithm works pretty well: the shaded area corresponds to the same music genre and is often the same artist, as we hope.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.