Spark maprlab-auction Data analysis

Last Update:2015-10-29 Source: Internet

Author: User

Tags mapr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, environmental installation

1. Installing Hadoop

http://my.oschina.net/u/204498/blog/519789

2. Install Spark

3. Start Hadoop

4. Start Spark

Two

1. Data preparation

Download the data dev360data.zip from the MAPR website and upload it to the server.

[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email  protected] spark-1.5.1-bin-hadoop2.6]$ cd test-data/[[email protected]  test-data]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6/test-data/dev360data[[email protected]  dev360data]$ lltotal 337940-rwxr-xr-x 1 hadoop root    575014  Jun 24 16:18 auctiondata.csv        => C Test Data-rw-r--r-- 1 hadoop root  57772855 aug 18 20:11  sfpd.csv-rwxrwxrwx 1 hadoop root 287692676 jul 26 20:39 sfpd.json[[ email protected] dev360data]$ more auctiondata.csv 8213034705,95,2.927373,jake7870, 0,95,117.5,xbox,38213034705,115,2.943484,davidbresler2,1,95,117.5,xbox,38213034705,100,2.951285,gladimacowgirl , 58,95,117.5,xbox,38213034705,117.5,2.998947,dAysrus,10,95,117.5,xbox,38213060420,2,0.065266,donnie4814,5,1,120,xbox,38213060420,15.25,0.123218,myreeceyboy, 52,1,120,xbox,3. #数据结构如下auctionid, bid,bidtime,bidder,bidrate,openbid,price,itemtype,daystolve# upload the data to HDFs [ [Email protected] dev360data]$ hdfs dfs -mkdir -p /spark/exer/mapr[[email  protected] dev360data]$ hdfs dfs -put auctiondata.csv /spark/exer/mapr[[ email protected] dev360data]$ hdfs dfs -ls /spark/exer/maprfound 1  items-rw-r--r--   2 hadoop supergroup     575014  2015-10-29 06:17 /spark/exer/mapr/auctiondata.csv

2. Run Spark-shell I'm using Scala. And for the following tasks, analyze

Tasks

A.how Many items were sold?

B.how Many bids per item type?

C.how many different kinds of item type?

D.what was the minimum number of bids?

E.what was the maximum number of bids?

F.what was the average number of bids?

[[Email protected] spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[[email  protected] spark-1.5.1-bin-hadoop2.6]$ ./bin/spark-shell ......scala ># First, the rddscala > val originalrdd = sc.textfile is generated from the HDFS load data ("/spark/exer/mapr/ Auctiondata.csv ") ......scala > originalrdd      ==> Let's analyze the type of Originalrdd  RDD[String]  can be seen as a String array, Array[string]res26: org.apache.spark.rdd.rdd [string] = mappartitionsrdd[1] at textfile at <console>:21# #根据 "," Separate each row with Mapscala > val auctionrdd = originalrdd.map (_.split (",")) scala>  auctionrdd        ==> us to analyze the type of Auctionrdd  rdd[array[string]]   can be thought of as a string array, but the element is still an array, which can be considered array[array[string]]res17: org.apache.spark.rdd.rdd[array[string]]  = mappartitionsrdd[5] at map at&nbsP;<console>:23

A. Howmany items were sold?

==> val count = auctionrdd.map (bid = bid (0)). Distinct (). COUNT ()

According to the Auctionid to the weight can: each record according to "," separate, then go heavy, then count

#获取第一列, that is, get Auctionid, still with map# can understand the following line, because Auctionrdd is array[array[string]] then the map of each parameter type is array[string], Since ActionId is the first bit of the array, that is to get the first element of array (0), note Yes () is not []scala> val Auctionidrdd = Auctionrdd.map (_ (0)) ......scala> Auctionidrdd ==> We analyze the Auctionidrdd type rdd[string], understood as array[string], that is, all auctionid arrays Res27:org.apache.spark.rdd . Rdd[string] = mappartitionsrdd[17] at map at <console>:26# to Auctionidrdd de-weight Scala > Val auctioniddistinctrdd= Auctionidrdd.distinct () #计数scala > Auctioniddistinctrdd.count () ...

B.how Many bids per item type?

===> Auctionrdd.map (Bid (7), 1)). Reducebykey ((x, y) = x + y). Collect ()

#map每一行, gets the 7th column, the itemtype column, the output (itemtype,1) #可以看做输出的类型是 (string,int) of the array scala > auctionrdd.map (bid= > (Bid (7), 1)) res30: org.apache.spark.rdd.rdd[(String, int)] = mappartitionsrdd[26]  at map at <console>:26. #reduceByKey即按照key进行reduce # Parse the Reducebykey for the same key, # (xbox,1 ) (xbox,1) (xbox,1) (xbox,1) ... (xbox,1)  ==> reduceByKey ==>  (Xbox, (..) (((1 + 1)  + 1)  + ... + 1)) Scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y)  => x + y) # The type remains (String,int) of the array  string=>itemtype int is already the sum of the itemtype count res31: org.apache.spark.rdd.rdd[( String, int)] = shuffledrdd[28] at reducebykey at <console>:26# Collect ()   converted to  array type array scala > auctionrdd.map (bid=> (Bid (7), 1)). Reducebykey ((x, y)  => x + y). Collect () res32: array[(string, int)]  = array ((palm,5917),  (cartier,1953),  (xbox,2784))

Spark maprlab-auction Data analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More