Spark Practice-Music recommendations and Audioscrobbler datasets

Source: Internet
Author: User

This article is based on the music recommendation and Audioscrobbler data in the 3rd chapter of the Spark Advanced Data analysis
Complete code See Https://github.com/libaoquan95/aasPractice/tree/master/c3/recommend

1. Get Data Set

This chapter example uses a data set that is publicly published by Audioscrobbler. Audioscrobbler is Last.fm's first music recommendation system. Last.fm was founded in 2002 and is one of the earliest Internet streaming radio stations.

The Audioscrobbler dataset is a bit special because it only records playback data, the main data set in file User_artist_data.txt, which contains 141 000 users and 1.6 million artists, and records about 24.2 million users playing the artist song, These include playing the secondary
Number of messages.

The data set gives each artist's ID and the corresponding name in the Artist_data.txt file. Note that when recording playback information, the client app submits the artist's name. If a name is misspelled, or a nonstandard name is used, it can be found afterwards. For example, "The Smiths" "Smiths, the" and "the Smiths" appear to represent the IDs of different artists, but they are clearly referring to the same artist. Therefore, in order to match the misspelled artist ID or ID variant to the artist's canonical ID, the dataset provides a artist_alias.txt file.

    1. Http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html (original book address, expired)
    2. Https://github.com/libaoquan95/aasPractice/tree/master/c3/profiledata_06-May-2005 (DataSet is larger than git upload limit, split volume compression)
2. Data processing

Load a data set

val dataDirBase = "profiledata_06-May-2005/"val rawUserArtistData = sc.read.textFile(dataDirBase + "user_artist_data.txt")val rawArtistData = sc.read.textFile(dataDirBase + "artist_data.txt")val rawArtistAlias = sc.read.textFile(dataDirBase + "artist_alias.txt")rawUserArtistData.show()rawArtistData.show()rawArtistAlias.show()



Formatted data set, converted to DataFrame

val artistByID = rawArtistData.flatMap { line =>  val (id, name) = line.span(_ != ‘\t‘)  if (name.isEmpty()){    None  } else {    try {      Some((id.toInt, name.trim))    } catch{      case _: NumberFormatException => None    }  }}.toDF("id", "name").cache()val artistAlias = rawArtistAlias.flatMap { line =>  var Array(artist, alias) = line.split(‘\t‘)  if (artist.isEmpty()) {    None  } else {    Some((artist.toInt, alias.toInt))  }}.collect().toMapval bArtistAlias = sc.sparkContext.broadcast(artistAlias)val userArtistDF = rawUserArtistData.map { line =>  val Array(userId, artistID, count) = line.split(‘ ‘).map(_.toInt)  val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)  (userId, artistID, count)}.toDF("user", "artist", "count").cache()



View artist aliases and real names

val (badID, goodID) = artistAlias.headartistByID.filter($"id" isin (badID, goodID)).show()

3. Using Spark MLlib for recommendations

Spark MLlib uses ALS (alternating least squares) to implement a collaborative filtering algorithm that can be calculated by passing in triples (user ID, item ID, scoring), and note that the user ID and item ID must be integer data.

val Array(trainData, cvData) = userArtistDF.randomSplit(Array(0.9, 0.1))val model = new ALS().    setSeed(Random.nextLong()).    setImplicitPrefs(true).    setRank(10).    setRegParam(0.01).    setAlpha(1.0).    setMaxIter(5).    setUserCol("user").    setItemCol("artist").    setRatingCol("count").    setPredictionCol("prediction").    fit(trainData)

The recommended model has been built, but Spark MLlib can only recommend individual users at a time, and cannot make a single global recommendation.

val userId = 2093760val topN = 10val toRecommend = model.itemFactors.  select($"id".as("artist")).  withColumn("user", lit(userId))val topRecommendations  = model.transform(toRecommend).  select("artist", "prediction").  orderBy($"prediction".desc).  limit(topN)// 查看推荐结果val recommendedArtistIDs = topRecommendations.select("artist").as[Int].collect()artistByID.join(sc.createDataset(recommendedArtistIDs).  toDF("id"), "id").  select("name").show()

Spark Practice-Music recommendations and Audioscrobbler datasets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.