In the Hadoop, zookeeper, hbase, spark cluster environment has set up the environment, 工欲善其事 its prerequisite, now the device has been, the next is to open up, first from Spark-shell began to uncover spark artifact veil.
Spark-shell is the command line interface of Spark, we can directly hit some commands above, just like Windows CMD, into the Spark installation directory, execute the following command to open Spark-shell:
Bin/spark-Shell --Master spark://hxf:7077 --Executor-Memory 1024m --Driver-Memory 1024m -- Total-Executor-cores 4
Executor-memory is slave memory, Driver-memory is Master's memory, Total-executor-cores is all the number of cores
Terminal display As you can see Spark-shell has helped us initialize two variables SC, SPARK,SC is spark Context,spark is spark session, have not eaten pork seen pig run, Like these include context ah session AH don't think it is very important, the same spark implementation is relying on these two variables, at present, the first familiar, and later
The Spark Management page displays such as:
OK, now let's start knocking the first example to count the number of occurrences of each word in the Spark directory readme.md this file:
First give the complete code, convenient for everyone to have a whole idea:
val textFile = sc.textFile("file:/data/install/spark-2.0.0-bin-hadoop2.7/README.md")val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()
The code is simple, but the first time you see it may not be very understanding, the following explanation
1. How spark reads the raw data
First read the README.MD:
val textFile = sc.textFile("README.md")
This code is to read raw data into Spark's own data format RDD, which generally reads raw data in two ways
1. Test usage: Call Sparkcontext's Parallelize method
val rdd = sc.parallelize(Array(1to10))
This gets the 1 to 10 array, which is used to test the program, formally developed without this
2, formal usage: All Hadoop can use the data source spark can be used, of course, our most commonly used is the Sparkcontext Textfile method, such as reading the files on the HDFs:
val rdd = sc.parallelize("hadoop://hxf:9000/test/test.log")
2. Spark's underlying data type Rdd
The result obtained by Textfile is called the Rdd, which is the basic data type of spark.
The RDD is the abbreviation for the resillient distributed dataset, meaning the elastic distributed data set, which is not very well understood, but we can literally understand that the RDD is distributed and is a collection of data, assuming that there are multiple files under the distributed system, These files have many lines, and the Rdd refers to a collection of all the rows of all these files, not a single row. So what we do with the RDD is all about the entire collection, and Spark is putting the entire rdd in memory, not on the disk like MapReduce, so that spark can operate faster than MapReduce.
Then continue with the code:
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts.collect()
The final results show the number of occurrences of each word, the code of Flatmap, map, Reducebykey is the RDD conversion operation, Collect is the action of the RDD operation, do not understand that it does not matter, after the article detailed. This section is here for the time being, please listen to tell.
Learning spark--use Spark-shell to run Word Count