Spark Practice-Music recommendations and Audioscrobbler datasets

Last Update:2018-05-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is based on the music recommendation and Audioscrobbler data in the 3rd chapter of the Spark Advanced Data analysis
Complete code See Https://github.com/libaoquan95/aasPractice/tree/master/c3/recommend

1. Get Data Set

This chapter example uses a data set that is publicly published by Audioscrobbler. Audioscrobbler is Last.fm's first music recommendation system. Last.fm was founded in 2002 and is one of the earliest Internet streaming radio stations.

The Audioscrobbler dataset is a bit special because it only records playback data, the main data set in file User_artist_data.txt, which contains 141 000 users and 1.6 million artists, and records about 24.2 million users playing the artist song, These include playing the secondary
Number of messages.

The data set gives each artist's ID and the corresponding name in the Artist_data.txt file. Note that when recording playback information, the client app submits the artist's name. If a name is misspelled, or a nonstandard name is used, it can be found afterwards. For example, "The Smiths" "Smiths, the" and "the Smiths" appear to represent the IDs of different artists, but they are clearly referring to the same artist. Therefore, in order to match the misspelled artist ID or ID variant to the artist's canonical ID, the dataset provides a artist_alias.txt file.

：

Http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html (original book address, expired)
Https://github.com/libaoquan95/aasPractice/tree/master/c3/profiledata_06-May-2005 (DataSet is larger than git upload limit, split volume compression)

2. Data processing

Load a data set

val dataDirBase = "profiledata_06-May-2005/"val rawUserArtistData = sc.read.textFile(dataDirBase + "user_artist_data.txt")val rawArtistData = sc.read.textFile(dataDirBase + "artist_data.txt")val rawArtistAlias = sc.read.textFile(dataDirBase + "artist_alias.txt")rawUserArtistData.show()rawArtistData.show()rawArtistAlias.show()

Formatted data set, converted to DataFrame

val artistByID = rawArtistData.flatMap { line =>  val (id, name) = line.span(_ != ‘\t‘)  if (name.isEmpty()){    None  } else {    try {      Some((id.toInt, name.trim))    } catch{      case _: NumberFormatException => None    }  }}.toDF("id", "name").cache()val artistAlias = rawArtistAlias.flatMap { line =>  var Array(artist, alias) = line.split(‘\t‘)  if (artist.isEmpty()) {    None  } else {    Some((artist.toInt, alias.toInt))  }}.collect().toMapval bArtistAlias = sc.sparkContext.broadcast(artistAlias)val userArtistDF = rawUserArtistData.map { line =>  val Array(userId, artistID, count) = line.split(‘ ‘).map(_.toInt)  val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)  (userId, artistID, count)}.toDF("user", "artist", "count").cache()

View artist aliases and real names

val (badID, goodID) = artistAlias.headartistByID.filter($"id" isin (badID, goodID)).show()

3. Using Spark MLlib for recommendations

Spark MLlib uses ALS (alternating least squares) to implement a collaborative filtering algorithm that can be calculated by passing in triples (user ID, item ID, scoring), and note that the user ID and item ID must be integer data.

val Array(trainData, cvData) = userArtistDF.randomSplit(Array(0.9, 0.1))val model = new ALS().    setSeed(Random.nextLong()).    setImplicitPrefs(true).    setRank(10).    setRegParam(0.01).    setAlpha(1.0).    setMaxIter(5).    setUserCol("user").    setItemCol("artist").    setRatingCol("count").    setPredictionCol("prediction").    fit(trainData)

The recommended model has been built, but Spark MLlib can only recommend individual users at a time, and cannot make a single global recommendation.

val userId = 2093760val topN = 10val toRecommend = model.itemFactors.  select($"id".as("artist")).  withColumn("user", lit(userId))val topRecommendations  = model.transform(toRecommend).  select("artist", "prediction").  orderBy($"prediction".desc).  limit(topN)// 查看推荐结果val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()artistByID.join(sc.createDataset(recommendedArtistIDs).  toDF("id"), "id").  select("name").show()

Spark Practice-Music recommendations and Audioscrobbler datasets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Practice-Music recommendations and Audioscrobbler datasets

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support