Spark Interactive scripting is a simple way to learn about APIs and a powerful tool for analyzing data set interactions.
The spark abstract distributed cluster space is called the Resilient Distributed DataSet (RDD) elastic data set.
There are two ways to create RDD:
(1), input from the file system of Hadoop (e.g. HDFs);
(2), there are other existing RDD conversion to get new Rdd;
Here's a simple test:
1. Enter Spark_home/bin to run the command:
$./spark-shell
2. Create a new RDD using a text file on HDFs:
scala> var textfile = sc.textfile ("Hdfs://localhost:50040/input/wordcount/text1");
Textfile:org.apache.spark.rdd.rdd[string] = mappedrdd[1] at textfile at <console>:12
3. RDD has two types of operations, namely a ction (return values) and transformations (returns a new RDD)
(1) The action is equivalent to performing an action that returns a result:
Scala> Textfile.count ()///RDD how many rows
Output Result 2:
14/11/11 22:59:07 INFO Spark. Sparkcontext:job Finished:count at <console>:15, took 5.654325469 s
res1:long = 2
Scala> Textfile.first ()//Rdd The contents of the first line
Result output:
14/11/11 23:01:25 INFO Spark. Sparkcontext:job Finished:first at <console>:15, took 0.049004829 s
res3:string = Hello World
(2) transformation is equivalent to a conversion that converts a rdd and returns a new RDD:
Scala> Textfile.filter (line => line.contains ("Hello")). Count ()//How many rows contain hello
Result output:
14/11/11 23:06:33 INFO Spark. Sparkcontext:job Finished:count at <console>:15, took 0.867975549 s
res4:long = 2
4. The wordcount of the spark shell
scala> val file = Sc.textfile ("Hdfs://localhost:50040/input")
scala> val count = File.flatmap (line => Line.split ("")). Map (Word => (Word, 1)). Reducebykey (_+_)
scala> count.collect ()
Output results:
14/11/11 23:11:46 INFO Spark. Sparkcontext:job Finished:collect at <console>:17, took 1.624248037 s
res5:array[(String, Int)] = Array (hell o,2), (world,1), (my,1), (is,1), (love,1), (i,1), (urey,1), (hadoop,1), (name,1), (programming,1))
Of course, based on Rdd there are a lot of complex operations, the following learning process encountered in the summary.