Today, we come into the second chapter of Spark Learning, found that a lot of things have begun to change, life is not simple to the direction you want to go, but still need to work hard, do not say chicken soup, etc.
Start our journey to spark today
I. What is an RDD?
The Chinese interpretation of the RDD is an elastic distributed dataset, the full name resilient distributed datases, the in-memory data set,
The RDD is read-only and can be partitioned, and all or part of the data set can be cached in memory and reused over time, so-called
Elasticity, refers to the memory is not enough to be interchangeable with the disk
Two. Spark operator
Spark operators are divided into two categories, called transformation (conversion), a class called action (action)
Transformation deferred execution, transformation records metadata information and starts a real execution when the compute task violates the action (also described in the previous section)
In this case, either the map or the filter method is the transform method, so this value does not really change , until collect, this is the action, then its real value will be called
Three. Two ways to create an RDD
1. Creating a Rdd,rdd file system with HDFS support There is no real data to be computed, just a log of the meta data
2. Create an RDD in a parallel way through a Scala collection or array
Look at the internal implementation of the RDD summary (5 features)
Internally, each RDD are characterized by five main properties:
-A List of partitions
-A function for computing each split
-Alist of dependencies on other RDDs
-Optionally, a partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
-Optionally, a list of preferred locations to compute each split on (e.g. block locations an HDFS file)
Four. Spark's first program on idea
1. First we write a spark program on idea and then the package
Object WordCount { def main (args:array[string]): Unit = { //very important, entrance to the spark cluster val conf = new sparkconf (). Set AppName ("WC") val sc = new Sparkcontext (conf) sc.textfile (args (0)). FlatMap (_.split (")"). Map (((((_,1))). Reducebykey (_+_). SortBy (_._2). Saveastextfile (args (1)) sc.stop () }}
The first thing to clarify is that our spark is created in maven form, so our pom file adds support for Spark
When we were in the package, we would generate two jar packages in target, and we would choose a large size that might include other libraries
2. Upload to Linux and submit (this is similar to executing jar packages on Hadoop)
./spark-submit --master spark://192.168.109.136:7077 --class cn.wj.spark.WordCount -- Executor-memory 512m --total-executor-cores 2/tmp/hello-spark-1.0.jar hdfs://192.168.109.136:9000/wc/* Hdfs://192.168.109.136:9000/wc/out
This time we can view the current project execution of Spark through 192.168.109.136:8080
Five. Master's relationship with worker
Master manages all the workers, then the resource is dispatched, the worker manages the current node, and the worker starts executor to complete the actual calculation
SPARK-02 (RDD and simple operators)