This article refers to http://blog.csdn.net/zdy0_2004/article/details/43896015 translation and the original file:///F:/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9% A0/recommending%20music%20on%20spotify%20with%20deep%20learning%20%e2%80%93%20sander%20dieleman.html
This article is a blog post by Dr. Sander Dieleman, Reservoir Lab Laboratory at the University of Ghent (Ghent University) in Belgium, where his research focuses on the classification of Music audio signals and the recommended hierarchical characterization of learning, specializing in deep learning and feature learning.
1. First introduce what Spotify is
Spotify is one of the world's largest authentic streaming music service platforms, officially launched in October 2008 in Stockholm, the capital of Sweden. Spotify offers services in two categories: free and paid, and free users will be interrupted by a certain amount of advertising when using Spotify's services. Paying users have no ads and can have better sound quality, and they can have all the features they use on their mobile devices. As of January 2015, Spotify already has more than 60 million users, 15 million of whom are paid users.
This article mainly from a few aspects to make a summary:
- Collaborative filtering (collaborative filtering) is a simple introduction, including its pros and cons.
- Content-based recommendations (content-based recommendation) What to do if the data is not available.
- Depth learning predicts listener preferences (predicting listening preferences with deep learning) music recommendations based on audio signals.
- efficiency boost (Scaling up) Some details about my convolutional neural network in Spotify training.
- Analysis: What is it learning? convolutional Neural Network Music Learning at a glance, including multiple audio examples.
- where can I apply it (what is this being used for)? some of the potential applications of the results of my work.
- Work for the future
- Conclusion (conclusion)
2. Several ways to recommend a system:
The network music platform Spotify traditionally relies on collaborative filtering to drive music recommendations.
(1) The practice of collaborative filtering: the main idea is to speculate the user's preferences from the historical record. If two people listen to a similar song, then the two people are similar, if two songs by the same group listening, then the two songs similar, so the practice of collaborative filtering does not consider the content of the recommendation of knowledge, so the collaborative filtering algorithm is content-independent (in addition to the relevant consumption pattern information, Does not involve any information about the recommended item itself), so the unified algorithm can be used to recommend books, movies or music, but this method will lead to a problem is the popular project (because he has more use of data) is always more than the non-popular items (actually there may be more useful information) easier to recommend , which is usually not what we want.
Another problem with specific music is the content heterogeneity of similar usage patterns (heterogeneity of contents with similar usage patterns). For example, the listener may have heard the entire album at one time, and the album may contain primers, finale, episodes, Sing, and remix songs. They may not all be typical works of the artist, so they are not a good recommendation. However, the collaborative filtering algorithm does not solve this problem.
The biggest problem, however, may be the inability to recommend new and non-popular songs : If there is no usage data available for analysis, the collaborative filtering approach will fail. This is the so-called cold start problem . (Collaborative filtering algorithm is difficult to solve cold start problem)
(2) Content-based recommendations
Spotify has recently begun to consider combining other sources of information in the recommended pipeline to reduce these issues, based on feedback from the Echo Nest, the Smart music platform, several months ago. There are many types of information that can help with music recommendations: tags, artist and album information, lyrics, documents mined from the Internet (comments, interviews ...). ), as well as the audio signal itself.
In these sources, the audio signal is probably the most difficult to use effectively. On the one hand, because the semantic difference between music audio signals (semantic gap) is large, on the other hand, the factors that affect audience preferences are varied. Some of the information can be easily extracted from the audio signal, such as the type of music and playing instruments, while others are more challenging, such as the mood of music, and the release of the year (or period), and some are actually impossible to get from the audio: Just like the artist's location and lyrical theme.
Despite these challenges, it is clear that the actual sound of the song greatly affects the listener's willingness to listen. So by analyzing the audio signal and predicting who might appreciate the song, it looks like a good idea.
Deep learning predicts audience preferences
Last December, my colleague and I, Aäron van den Oord, published a paper on this subject in nips titled ' Deep content-based Music recommendation ' (Content-based in-depth musical recommendations). We tried to solve the problem by training regression models (regression model), predicting the hidden characterization of songs from the collaborative filtering model (latent representations), and realizing listening preferences based on audio signals. This approach allows us to predict song representations in collaborative filtering space even when data is not being used. (as can be inferred from the topic of the paper, the regression model involved is a deep neural network).
The basic idea of this approach is to assume that many of the collaborative filtering models are projecting listeners and songs into a shared low-dimensional hidden space (latent spaces). The location of the song in this space contains various coding information that affects the audience's preferences. If there are two songs near the space, they are likely to be similar. If a song is close to an audience, this song may be a good recommendation for him (if he hasn't heard it yet). If you can predict the position of a song in this space through an audio signal, you can recommend it to the right audience without the need for historical usage data.
In this paper, we make a visual effect, that is, by casting the predicted results of the model in the hidden space to the two-dimensional space which is reduced by using the T-sne algorithm. As can be seen from the results below, similar songs are clustered together. Rap music mainly appears in the upper left corner, while the electroacoustic artist gathers at the bottom of the graph.
Implicit spatial visualization of the T-sne algorithm (central). Several close-ups show the artist who is projecting the song in a particular area. Excerpted from deep content-based music recommendation, Aäron van den Oord, Sander Dieleman and Benjamin Schrauwen, NIPS 2013.
Efficiency improvement
The deep neural network trained in our dissertation consists of two convolution layers and two fully connected layers. The input is the sound spectrum of the 3-second audio fragment. For predictions of longer audio fragments, just divide it into several 3-second-long windows, and then average the predicted values for those windows.
I contacted a large number of song data sources in Spotify, as well as the hidden factor representations (latent factor representations) generated from different collaborative filtering models. I also come with a high-level GPU for experimental operations. They have considerably improved efficiency. Now I'm training a total of 7-or 8-layer convolutional neural Networks (convnets), using much larger intermediate representations and more parameters.
Architecture
Detailed below is one of the many architectures I have experimented with. It has four convolutional layers and three dense layers (dense layers). You will see a convolutional neural network for audio signal design, and there are some important differences with traditional neural networks for computer vision network tasks.
Warning: There are horrible details below! If you're not too concerned with details such as relus, maximum pooling (max-pooling) and small batch gradient descent (Minibatch gradient descent), skip to the "Analysis" section.
I experimented with a convolutional neural network architecture for the prediction of hidden factors. The longitudinal axis is the timeline (in its convolution).
The network input is a series of Mel Sound Spectrum (Mel-spectrograms), which have 599 frames (frames) and 128 frequency points (frequency bins). Mel Sound Spectrum is a time-frequency meter (time-frequency representation). is obtained from the narrow overlapping window Fourier transform (Fourier transforms) of the audio signal. Each Fourier transform constitutes a frame. These successive frames are then arranged into a matrix, which forms the sound spectrum. Finally, the frequency axis is changed from a linear scale to a Mel scale to reduce the number of dimensions and take a logarithmic scale value.
The convolution layer is displayed in a red rectangle, showing the situation when the filter slips over the input. They use linear correction units (Relus, the activation function used is max (0, X)). Note that all of these convolution are one-dimensional, and the convolution appears only in the time dimension, not the frequency dimension. Although it is technically possible to convolution along the two axes of the spectrogram, I have not done so now. To realize that unlike images, the meaning of the two axes of the spectrogram is different (time and frequency), which is very important. As a result, the typical square filter in the image data is meaningless here.
The maximum pooling operation (max-pooling operations) between convolution layers decreases the sampling rate in the middle of time domain and increases the time invariance of the process. These actions are represented by "MP". It can be seen that a filter with a size of 4 frames is used in each convolutional layer, with a maximum pool size of 4 for the first and second convolution (mainly for performance reasons), and a maximum pooling of pool dimensions of 2 between the other tiers.
After the final convolution, I added a global time domain pooling layer (global temporal pooling layers). This layer covers the entire timeline, effectively calculating the statistical values of time-domain learning characteristics. I introduced three different pooling functions: average (mean), maximum (maximum), and L2 norm (l2-norm).
The reason I do this is because the absolute position features detected from the audio signal are not particularly relevant to the requirements of the task at hand. The situation here is different from image classification: In image classification, it is possible to know the approximate position of a feature. For example, detecting a cloud feature is likely to activate the upper part of the image. If activated in the lower half, the sheep may be detected. In the case of music recommendation, we usually only have some features in the music as a whole or a lack of interest, so it is reasonable to do the pooling in time.
Another way to do this is to train the network with short audio clips, and get a longer fragment of data by averaging the output of these windows, as we did in the Nips paper. However, it seems better to refer to pooling in the model, because you can start using this process step in the learning phase.
The global pooling feature of the 2048 linear correction units is entered into a string of fully connected layers (fully-connected layers). There are only two of these strings in this network. The last layer of the network is the output layer, which uses the VECTOR_EXP algorithm from the various collaborative filtering algorithms used by Spotify to predict 40 hidden factors.
Training
The training network reduces the mean variance (MSE) between hidden factor vectors and audio predictions for the output of the co-filtering model. These vectors are first normalized according to the unit norm Standard. This is done to reduce the impact of song popularity (many of the hidden factor vector norms of the collaborative filtering model are often related to the popularity of songs). The discarding method (dropout) is used as the regularization method in dense layer.
The dataset I'm using now is the 30-second-long Mel sound spectrum captured from Spotify's 1 million most popular tracks. I used about half of the tracks for training (0.5M), about 5,000 for online verification, and the rest for testing. At the time of training, the sound spectrum is adjusted slightly, and the data is expanded by making a random shift along the timeline.
The implementation of the network employs the NVIDIA (NVIDIA) GeForce GTX 780Ti GPU hardware, Theano software framework. A small batch gradient descent method and a Nesterov impulse factor (Nesterov momentum) are used. With a single process for data loading and tuning, the next batch of data can be loaded in parallel when the GPU is used for bulk data training. A total of approximately 750,000 gradient updates were performed. I can't remember the exact time to train this particular structure, but I remember the total test time between 18-36 hours.
Change (Variation)
As I mentioned earlier, this is just one example of the architecture I've tried. I have tried, or will be experimenting with:
- More levels!
- Use the maximum Output unit (maxout unit) instead of the linear correction unit (rectified linear unit).
- Use random pooling (stochastic pooling) instead of maximum pooling (max-pooling).
- Introduce L2 normalization in the output layer of the network.
- Stretches or compresses the sound spectrum extension data on the time domain.
- A vector of hidden factors that cascade various collaborative filtering models output.
Here are some of the things that work better than expected:
- Pooling of ' bypass ' (' bypass ') connections from each convolutional layer to the network full Nexus layer is pooled with the global time domain. The basic hypothesis is that the statistical results of low-level features are also useful for recommendations, but unfortunately it has too many restrictions on training.
- The conditional variance of the predictors, like the mixed density network (mixture density networks), is a predictive confidence estimate, which is used to identify songs when hidden factors are difficult to predict. Unfortunately it seems that it makes training extremely difficult, and the confidence estimate also behaves differently than expected.
Analysis: What is it learning?
Now there's a cool part: What exactly are these networks learning? What does the feature look like? The main reason I chose convolutional networks to solve this problem is to think that music recommendation based on audio signals is a complex problem that joins multiple levels of abstraction. I hope that the continuous network layer will gradually learn more complex and more invariant features as in the image classification problem.
The reality seems to be true. First, let's take a look at the first convolutional layer, which learns a set of filters that are applied directly to the input spectrum. These filters are easy to visualize. They are displayed in the following images. Click to see the high-resolution version (5584x562, ~600kb). Negative values are red, positive values are blue and white is 0. Note that each filter is only four frames wide. The dark red vertical line separates the individual filters.
The first convolution layer learns the visualization of the filter. The timeline is the horizontal axis, and the frequency axis is the vertical (frequency is increased from top to bottom). Click to see the high-resolution version (5584x562, ~600kb).
As can be seen from this representation, many filters detect harmonic components, which are shown in parallel red and blue bands at different frequencies. Sometimes these bands are tilted upward or downward, indicating a rise or fall in pitch. It proves that these filters help detect vocals.
Low-level feature playlist: Maximum activation
In order to have a better understanding of what the filter learns, I have prepared some of the most active test song set playlists. Here are a few examples. The first layer of the network has 256 filters, which are numbered from 0 to 255. Note that this number is arbitrary because the filter is not sorted.
These four playlists are obtained by finding the songs that are most active for a given filter within 30 seconds of the analysis. I selected several seemingly interesting filters from the first convolution, calculated each feature representation, and then looked for the maximum activation from the entire test set. Note that if you want to understand what the filter is receiving, you should listen to the middle of the track, as this part of the audio signal is the part of the analysis.
Each of the Spotify playlists below has 10 tracks. Some tracks are not available in some countries due to copyright issues.
Filter 14: Vibrato singing filter 242: Ambient atmosphere (ambience)
Filter 250: Vocal Juniors (Vocal thirds) filter 253: bass drum
Close-up of filters 14,242, 250 and 253.
- The filter 14 seems to detect a vibrato song (vibrato singing).
- The filter 242 detects some kind of ringing atmosphere (ringing ambience).
- Filter 250 detects the vocals (vocal thirds), where multiple singers sing a song, but the notes are three degrees apart (four chromatic).
- Filter 253 detects various types of bass drum tones.
The genres of the tracks in these playlists are very different, which means they are mainly detected from the low-level features of the audio signal.
Low-level feature playlist: Average activation
The following four playlists are obtained in a slightly different way: first calculate the active average of the time domain features for each track, and then find the maximum values in them. This means that in these playlists, the filter involved has been in effect for 30 seconds of analysis (that is, it will not be just a ' peak '). This is more useful for detecting harmony patterns.
Filter 1: Noise, distortion filter 2: Pitch (A, Bb)
Filter 4: Hum Filter 28: Harmony (A, Am)
A close-up view of the filter 1,2,4 and 28.
- Filter 1 detects noise and (guitar) distortion tones.
- Filter 2 seems to detect a special pitch: a bass BB. It also sometimes detects a sound (low chromatic), because the frequency resolution of Mel's sound spectrum is not high enough to differentiate between the two tones.
- Filter 4 detects various bass hum (drones).
- Filter 28 detects a chord. It looks like it detects both a minor scale and a large scale version, so it may only detect pitch A and E (five degree intervals).
I find it interesting that the network has learned to detect special pitches and harmonies. I used to think that the exact pitch and harmony in the song did not affect the audience's level of affection. As to why this is so, I have two guesses:
- After training various filters with different homophonic, the network actually only learns to detect the harmonic (harmonicity). They are then pooled together at higher levels to detect the harmonic of various pitches.
- The web has learned that in some genres of music, a chord and chord progression (chord progressions) is more common than other chords.
I haven't verified any of the two points above, but it seems that the latter is more of a challenge for the network, so I think the former is more likely.
Advanced Feature Play Table
Each layer of the network obtains feature performance from the next layer and then extracts a set of advanced features from it. At the top of the network, the fully connected layer, which is closest to the front layer of the output layer, the learned filter is very selective for some subtopics. Clearly, visualizing the results of these filters at the spectral level is not a simple matter. The following is a playlist of six test set songs, which are the most active of several advanced filters.
Filter 3: Christian Rock (Christian Rock) filter 15: Chorus/no accompaniment chorus + fashion Jazz
Filter 26: Gospel Filter 37: Chinese pop
Filter 49: Synthetic electronic music, 8-bit filter 1024:deep house music
It is clear that each of these filters identifies a particular type. Interestingly, some filters, such as number 15th, seem to be multi-modal (multimodal): It is strongly activated by two or more styles of music, and those music is often completely irrelevant. Presumably these filters eliminate the output ambiguity after combining the activation of all other filters.
Filter 37 is interesting because it seems to recognize the Chinese language. This is not entirely impossible, because the Chinese voice library is unique compared to other languages. There are several other filters that seem to have learned a particular language: for example, there is a rap music that detects Spanish. There is also the possibility that there are other distinguishing features of Chinese pop music, and that model is the one that detects this feature.
It took me some time to make a detailed analysis of the first 50 filters. Some of the other types of filters I came up with are: Lounge music, reggae (reggae), undercurrent (darkwave), country music, Metal core (metalcore), salsa (salsa), Dutch and German carnival music, children's songs, Vocal electric Sounds (vocal Trance), punk (punk), Turkish pop music, and my favorite ' exclusively Armin van Buuren '. Obviously because he has so many tracks, so he has his own filter.
Through the Alex Krizhevsky imagenet Network Learning Filter, has been reused in various computer vision tasks, and achieved great success. Based on the diversity and invariance characteristics of these filters (invariance properties), these filters for learning audio signals can be used for other music information retrieval tasks, in addition to predicting hidden factors.
A playlist based on similarity
The predicted hidden factor vectors can also be used to find songs that sound similar. Here are a few playlists: first predict the vector of factors for a given song, and then find out from the test set the cosine distance of the vector of its predictors, near the given song. In this way the first track in the playlist is always the query track itself.
The notorious B.i.g.–juicy cloudkicker-he would be riding on
(Hip-hop dance) The subway ... Post-Modern rock, avant-garde metal)
Architects-numbers Count for nothing neophyte-army of Hardcore
(Metal core, hard core) (Hard electronic music, BA Dance)
Fleet Foxes-sun It rises (indie ballad) John coltrane-my Favorite Things (Jazz)
Most of the similar tracks are recommended for fans who query the track. Of course, these lists are not perfect, but given the audio signal, the results are pretty good. There is an example of an error that appears in John Coltrane's ' My Favorite Things ' playlist, where a different point of the playlist is a few singular values (outliers), most notably in Elvis Presley ' crawfish '. The reason may be that the analyzed audio signal segment (from 8:40 to 9:10) contains a crazy saxophone solo. If you analyze the entire song, there may be better results.
What's the use of them?
Spotify has used a whole bunch of different information sources and algorithms in its recommended pipeline, so the most obvious application of my work is to add another source. Of course, it can also be used to filter out abnormal results recommended by other algorithms. As I have pointed out earlier, collaborative filtering algorithms tend to include primers, finale, turn-Sing, and remix in recommendations. These can be effectively filtered through an audio-based approach.
One of my main goals in this job is to use it to recommend new and not-so-popular music. I would like to offer help to those less well-known and future bands, by allowing Spotify to recommend their music to the right audience and get a fair competitive environment. (the band that advertised the future happened to be my main goal for a non-profit website, got-djent.com.) )
Hopefully soon some of their features will be able to start A/B testing, so we can tell if this audio-based recommendation is going to be extraordinary in practice. This is one thing I am very excited about because it is not easy to do in academia.
Work in the future
Another form of user feedback that Spotify collects is the user's thumbs up and down thumbs on the radio playing tracks. This type of information is useful for determining which tracks are similar. Unfortunately, the noise is also very large. I am currently trying to use this data in the ' Sort learning ' (learning to rank) setting. I am also experimenting with various distance metric learning programs, such as Drlim. If there were any cool results I would probably write a new article.
Conclusion
In this article I outlined the work that has been done so far in Spotify's machine learning internship. I explained the method of using Convolutional network to make audio-based music recommendation, and put forward some experiences about the actual learning effect of the convolutional network. For more detailed information on this method, please refer to the thesis ' Deep content-based music recommendation ' based on Aäron van den Oord in Nips 2013.
If you are interested in deep learning, feature learning and its application in music, you can go to research in my website to see what else I have done in this field. If you are interested in the method of Spotify in music recommendations, refer to SlideShare and Erik Bernhardsson in their blog.
Spotify is a cool place to work. They are very open to the way they are used (and allow me to write this blog post), which is not very common in industry.
Recommending music on Spotify and deep learning uses depth learning algorithms to make content-based musical recommendations for Spotify