First, environmental installation
1. Installing Hadoop
http://my.oschina.net/u/204498/blog/519789
2. Install Spark
3. Start Hadoop
4. Start Spark
Two
1. Data preparation
Download the data dev360data.zip from the MAPR website and upload it to the server.
[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email protected] spark-1.5.1-bin-hadoop2.6]$ cd test-data/[[email protected] test-data]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6/test-data/dev360data[[email protected] dev360data]$ lltotal 337940-rwxr-xr-x 1 hadoop root 575014 Jun 24 16:18 auctiondata.csv => C Test Data-rw-r--r-- 1 hadoop root 57772855 aug 18 20:11 sfpd.csv-rwxrwxrwx 1 hadoop root 287692676 jul 26 20:39 sfpd.json[[ email protected] dev360data]$ more auctiondata.csv 8213034705,95,2.927373,jake7870, 0,95,117.5,xbox,38213034705,115,2.943484,davidbresler2,1,95,117.5,xbox,38213034705,100,2.951285,gladimacowgirl , 58,95,117.5,xbox,38213034705,117.5,2.998947,dAysrus,10,95,117.5,xbox,38213060420,2,0.065266,donnie4814,5,1,120,xbox,38213060420,15.25,0.123218,myreeceyboy, 52,1,120,xbox,3. #数据结构如下auctionid, bid,bidtime,bidder,bidrate,openbid,price,itemtype,daystolve# upload the data to HDFs [ [Email protected] dev360data]$ hdfs dfs -mkdir -p /spark/exer/mapr[[email protected] dev360data]$ hdfs dfs -put auctiondata.csv /spark/exer/mapr[[ email protected] dev360data]$ hdfs dfs -ls /spark/exer/maprfound 1 items-rw-r--r-- 2 hadoop supergroup 575014 2015-10-29 06:17 /spark/exer/mapr/auctiondata.csv
2. Run Spark-shell I'm using Scala. And for the following tasks, analyze
Tasks
A.how Many items were sold?
B.how Many bids per item type?
C.how many different kinds of item type?
D.what was the minimum number of bids?
E.what was the maximum number of bids?
F.what was the average number of bids?
[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email protected] spark-1.5.1-bin-hadoop2.6]$ ./bin/spark-shell ......scala ># First, the rddscala > val originalrdd = sc.textfile is generated from the HDFS load data ("/spark/exer/mapr/ Auctiondata.csv ") ......scala > originalrdd ==> Let's analyze the type of Originalrdd RDD[String] can be seen as a String array, Array[string]res26: org.apache.spark.rdd.rdd [string] = mappartitionsrdd[1] at textfile at <console>:21# #根据 "," Separate each row with Mapscala > val auctionrdd = originalrdd.map (_.split (",")) scala> auctionrdd ==> us to analyze the type of Auctionrdd rdd[array[string]] can be thought of as a string array, but the element is still an array, which can be considered array[array[string]]res17: org.apache.spark.rdd.rdd[array[string]] = mappartitionsrdd[5] at map at&nbsP;<console>:23
A. Howmany items were sold?
==> val count = auctionrdd.map (bid = bid (0)). Distinct (). COUNT ()
According to the Auctionid to the weight can: each record according to "," separate, then go heavy, then count
#获取第一列, that is, get Auctionid, still with map# can understand the following line, because Auctionrdd is array[array[string]] then the map of each parameter type is array[string], Since ActionId is the first bit of the array, that is to get the first element of array (0), note Yes () is not []scala> val Auctionidrdd = Auctionrdd.map (_ (0)) ......scala> Auctionidrdd ==> We analyze the Auctionidrdd type rdd[string], understood as array[string], that is, all auctionid arrays Res27:org.apache.spark.rdd . Rdd[string] = mappartitionsrdd[17] at map at <console>:26# to Auctionidrdd de-weight Scala > Val auctioniddistinctrdd= Auctionidrdd.distinct () #计数scala > Auctioniddistinctrdd.count () ...
B.how Many bids per item type?
===> Auctionrdd.map (Bid (7), 1)). Reducebykey ((x, y) = x + y). Collect ()
#map每一行, gets the 7th column, the itemtype column, the output (itemtype,1) #可以看做输出的类型是 (string,int) of the array scala > auctionrdd.map (bid= > (Bid (7), 1)) res30: org.apache.spark.rdd.rdd[(String, int)] = mappartitionsrdd[26] at map at <console>:26. #reduceByKey即按照key进行reduce # Parse the Reducebykey for the same key, # (xbox,1 ) (xbox,1) (xbox,1) (xbox,1) ... (xbox,1) ==> reduceByKey ==> (Xbox, (..) (((1 + 1) + 1) + ... + 1)) Scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y) => x + y) # The type remains (String,int) of the array string=>itemtype int is already the sum of the itemtype count res31: org.apache.spark.rdd.rdd[( String, int)] = shuffledrdd[28] at reducebykey at <console>:26# Collect () converted to array type array scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y) => x + y). Collect () res32: array[(string, int)] = array ((palm,5917), (cartier,1953), (xbox,2784))
Spark maprlab-auction Data analysis