This article is based on the music recommendation and Audioscrobbler data in the 3rd chapter of the Spark Advanced Data analysis
Complete code See Https://github.com/libaoquan95/aasPractice/tree/master/c3/recommend
1. Get Data Set
This chapter example uses a data set that is publicly published by Audioscrobbler. Audioscrobbler is Last.fm's first music recommendation system. Last.fm was founded in 2002 and is one of the earliest Internet streaming radio stations.
The Audioscrobbler dataset is a bit special because it only records playback data, the main data set in file User_artist_data.txt, which contains 141 000 users and 1.6 million artists, and records about 24.2 million users playing the artist song, These include playing the secondary
Number of messages.
The data set gives each artist's ID and the corresponding name in the Artist_data.txt file. Note that when recording playback information, the client app submits the artist's name. If a name is misspelled, or a nonstandard name is used, it can be found afterwards. For example, "The Smiths" "Smiths, the" and "the Smiths" appear to represent the IDs of different artists, but they are clearly referring to the same artist. Therefore, in order to match the misspelled artist ID or ID variant to the artist's canonical ID, the dataset provides a artist_alias.txt file.
:
- Http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html (original book address, expired)
- Https://github.com/libaoquan95/aasPractice/tree/master/c3/profiledata_06-May-2005 (DataSet is larger than git upload limit, split volume compression)
2. Data processing
Load a data set
val dataDirBase = "profiledata_06-May-2005/"val rawUserArtistData = sc.read.textFile(dataDirBase + "user_artist_data.txt")val rawArtistData = sc.read.textFile(dataDirBase + "artist_data.txt")val rawArtistAlias = sc.read.textFile(dataDirBase + "artist_alias.txt")rawUserArtistData.show()rawArtistData.show()rawArtistAlias.show()
Formatted data set, converted to DataFrame
val artistByID = rawArtistData.flatMap { line => val (id, name) = line.span(_ != ‘\t‘) if (name.isEmpty()){ None } else { try { Some((id.toInt, name.trim)) } catch{ case _: NumberFormatException => None } }}.toDF("id", "name").cache()val artistAlias = rawArtistAlias.flatMap { line => var Array(artist, alias) = line.split(‘\t‘) if (artist.isEmpty()) { None } else { Some((artist.toInt, alias.toInt)) }}.collect().toMapval bArtistAlias = sc.sparkContext.broadcast(artistAlias)val userArtistDF = rawUserArtistData.map { line => val Array(userId, artistID, count) = line.split(‘ ‘).map(_.toInt) val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID) (userId, artistID, count)}.toDF("user", "artist", "count").cache()
View artist aliases and real names
val (badID, goodID) = artistAlias.headartistByID.filter($"id" isin (badID, goodID)).show()
3. Using Spark MLlib for recommendations
Spark MLlib uses ALS (alternating least squares) to implement a collaborative filtering algorithm that can be calculated by passing in triples (user ID, item ID, scoring), and note that the user ID and item ID must be integer data.
val Array(trainData, cvData) = userArtistDF.randomSplit(Array(0.9, 0.1))val model = new ALS(). setSeed(Random.nextLong()). setImplicitPrefs(true). setRank(10). setRegParam(0.01). setAlpha(1.0). setMaxIter(5). setUserCol("user"). setItemCol("artist"). setRatingCol("count"). setPredictionCol("prediction"). fit(trainData)
The recommended model has been built, but Spark MLlib can only recommend individual users at a time, and cannot make a single global recommendation.
val userId = 2093760val topN = 10val toRecommend = model.itemFactors. select($"id".as("artist")). withColumn("user", lit(userId))val topRecommendations = model.transform(toRecommend). select("artist", "prediction"). orderBy($"prediction".desc). limit(topN)// 查看推荐结果val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()artistByID.join(sc.createDataset(recommendedArtistIDs). toDF("id"), "id"). select("name").show()
Spark Practice-Music recommendations and Audioscrobbler datasets