Spark maprlab-auction Data analysis

Source: Internet
Author: User
Tags mapr

First, environmental installation

1. Installing Hadoop

http://my.oschina.net/u/204498/blog/519789

2. Install Spark


3. Start Hadoop

4. Start Spark

Two

1. Data preparation

Download the data dev360data.zip from the MAPR website and upload it to the server.

[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email  protected] spark-1.5.1-bin-hadoop2.6]$ cd test-data/[[email protected]  test-data]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6/test-data/dev360data[[email protected]  dev360data]$ lltotal 337940-rwxr-xr-x 1 hadoop root    575014  Jun 24 16:18 auctiondata.csv        => C Test Data-rw-r--r-- 1 hadoop root  57772855 aug 18 20:11  sfpd.csv-rwxrwxrwx 1 hadoop root 287692676 jul 26 20:39 sfpd.json[[ email protected] dev360data]$ more auctiondata.csv 8213034705,95,2.927373,jake7870, 0,95,117.5,xbox,38213034705,115,2.943484,davidbresler2,1,95,117.5,xbox,38213034705,100,2.951285,gladimacowgirl , 58,95,117.5,xbox,38213034705,117.5,2.998947,dAysrus,10,95,117.5,xbox,38213060420,2,0.065266,donnie4814,5,1,120,xbox,38213060420,15.25,0.123218,myreeceyboy, 52,1,120,xbox,3. #数据结构如下auctionid, bid,bidtime,bidder,bidrate,openbid,price,itemtype,daystolve# upload the data to HDFs [ [Email protected] dev360data]$ hdfs dfs -mkdir -p /spark/exer/mapr[[email  protected] dev360data]$ hdfs dfs -put auctiondata.csv /spark/exer/mapr[[ email protected] dev360data]$ hdfs dfs -ls /spark/exer/maprfound 1  items-rw-r--r--   2 hadoop supergroup     575014  2015-10-29 06:17 /spark/exer/mapr/auctiondata.csv

2. Run Spark-shell I'm using Scala. And for the following tasks, analyze

Tasks

A.how Many items were sold?

B.how Many bids per item type?

C.how many different kinds of item type?

D.what was the minimum number of bids?

E.what was the maximum number of bids?

F.what was the average number of bids?

[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email  protected] spark-1.5.1-bin-hadoop2.6]$ ./bin/spark-shell ......scala ># First, the rddscala > val originalrdd = sc.textfile is generated from the HDFS load data ("/spark/exer/mapr/ Auctiondata.csv ") ......scala > originalrdd      ==> Let's analyze the type of Originalrdd  RDD[String]  can be seen as a String array, Array[string]res26: org.apache.spark.rdd.rdd [string] = mappartitionsrdd[1] at textfile at <console>:21# #根据 "," Separate each row with Mapscala > val auctionrdd = originalrdd.map (_.split (",")) scala>  auctionrdd        ==> us to analyze the type of Auctionrdd  rdd[array[string]]   can be thought of as a string array, but the element is still an array, which can be considered array[array[string]]res17: org.apache.spark.rdd.rdd[array[string]]  = mappartitionsrdd[5] at map at&nbsP;<console>:23 

A. Howmany items were sold?

==> val count = auctionrdd.map (bid = bid (0)). Distinct (). COUNT ()

According to the Auctionid to the weight can: each record according to "," separate, then go heavy, then count

#获取第一列, that is, get Auctionid, still with map# can understand the following line, because Auctionrdd is array[array[string]] then the map of each parameter type is array[string], Since ActionId is the first bit of the array, that is to get the first element of array (0), note Yes () is not []scala> val Auctionidrdd = Auctionrdd.map (_ (0)) ......scala> Auctionidrdd ==> We analyze the Auctionidrdd type rdd[string], understood as array[string], that is, all auctionid arrays Res27:org.apache.spark.rdd . Rdd[string] = mappartitionsrdd[17] at map at <console>:26# to Auctionidrdd de-weight Scala > Val auctioniddistinctrdd= Auctionidrdd.distinct () #计数scala > Auctioniddistinctrdd.count () ...

B.how Many bids per item type?

===> Auctionrdd.map (Bid (7), 1)). Reducebykey ((x, y) = x + y). Collect ()

#map每一行, gets the 7th column, the itemtype column, the output (itemtype,1) #可以看做输出的类型是 (string,int) of the array scala > auctionrdd.map (bid= > (Bid (7), 1)) res30: org.apache.spark.rdd.rdd[(String, int)] = mappartitionsrdd[26]  at map at <console>:26. #reduceByKey即按照key进行reduce # Parse the Reducebykey for the same key, # (xbox,1 ) (xbox,1) (xbox,1) (xbox,1) ... (xbox,1)  ==> reduceByKey ==>  (Xbox, (..) (((1 + 1)  + 1)  + ... + 1)) Scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y)  => x + y) # The type remains (String,int) of the array  string=>itemtype int is already the sum of the itemtype count res31: org.apache.spark.rdd.rdd[( String, int)] = shuffledrdd[28] at reducebykey at <console>:26# Collect ()   converted to  array type array scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y)  => x + y). Collect () res32: array[(string, int)]  = array ((palm,5917),  (cartier,1953),  (xbox,2784)) 


Spark maprlab-auction Data analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.