Record Spark's WordCount applet: premise: HDFs is already open
Create a file named Wc.input and upload it to HDFs./user/hadoop/spark/, content such as
[Email protected] hadoop-2.6.0-cdh5.4.0]# Bin/hdfs dfs-put wc.input/user/hadoop/spark/ Upload
[Email protected] hadoop-2.6.0-cdh5.4.0]# Bin/hdfs Dfs-ls /user/hadoop/spark/ View Files
[Email protected] hadoop-2.6.0-cdh5.4.0]# Bin/hdfs dfs-text/uesr/hadoop/spark/wc.input View File Contents
[Email protected] spark-1.3.1]# Bin/spark-shell Open Spark's client
Scala> Val rdd=sc.textfile ("Hdfs://spark00:8020/user/hadoop/spark/wc.input") read files in DFS wc.input
Val wordcount = Rdd.flatmap (X=>x.split ("")). Map (x=> (x,1)). Reducebykey ((b) =>a+b) for MapReduce
Wordcount.collect View All
Rdd.flatmap (Line=>line.split ("")). Map (word=> (word,1)). Reducebykey ((b) =>a+b). Map (X=> (x._2,x._ 1). Sortbykey (FALSE). Map (x=> (x._2,x._1)). Collect
sc.textfile ("Hdfs:spark00:8020/user/hadoop/spark/wc.input"). FlatMap (Line=>line.split ("")). Map (word=> (word,1)). Reducebykey ((b) =>a+b). Collect
sc.textfile (...). FlatMap (_.split (")). Map ((_,1)). Reducebykey (_+_). Collect
Sort: Val wordsort = Wordcount.sortbykey (True)
Val Wordsort = Wordcount.sortbykey (False)
Wordsort.collect
The understanding of Rdd:
In Spark, an application contains multiple job tasks in MapReduce, a job task is an application
RDDFeatures of:1 "Partition Partitioned,split
2 "Calculate COMPUTE
3 " Dependent
the Web translation and understanding of the RDD feature officer:
1,a List of partitions a series of shards: for example, 64M, similar to split in Hadoop
2,a function for computing each split there is a function on each shard to iterate/execute/Calculate it
3,a List of dependencies on other RDDs a series of dependencies: Rdda converted to rddb,rddb into RDDC, then RDDC relies on RDDB,RDDB for Rdda
4,optionally,a partitioner for Key-value RDDs (e.g. to say, the RDD is hash-partitioned) for Key-value The RDD can specify a partitioner, tell it how to shard; commonly used Hash,range
5,optionally,a List of preferred location (s) to compute each split on (e.g.BlockLocations for anHDFSfile) which machines are best to run for calculation/execution, data locality
Why would there be a few?
For example: Hadoop has three locations by default, or sparkcache to memory is possible to set multiple replicas through Storagelevel, so a partition may return multiple best locations
From for notes (Wiz)
Spark's WordCount