Explanation of Spark Scheduling Architecture Principle

Source: Internet
Author: User
Keywords spark spark architecture spark scheduling architecture

This article mainly introduces the detailed explanation of the principle of Spark scheduling architecture, which has certain reference value. 


1. Starting the spark cluster is to execute sbin/start-all.sh to start the master and multiple worker nodes. The master is mainly used for cluster management and monitoring, and the worker nodes are mainly responsible for running various application tasks. The master node needs to let the worker node report its status, such as the CPU and how much memory, this process is completed through the heartbeat mechanism

2. The master will give the worker information after receiving the report information from the worker

3. The driver submits the task to the spark cluster [The communication between the driver and the master is done through AKKAactor, which means that the master is an actor model in the akkaactor asynchronous communication model, the driver is the same, the driver sends registration information to the mater asynchronously ( registerApplication) asynchronous registration information]

4. The master node estimates the application, 7 G of memory is used to complete the task, and the task is allocated. Each worker node is allocated 3.5 G of memory to execute the task. The master monitors the tasks on each worker as a whole Dispatch

5. The worker node receives the task and starts execution, and starts the corresponding executor process on the worker node to execute. Each executor has a concept of a thread pool, and there are multiple task threads stored in it

6.executor will take the task from the thread pool to calculate the data in rddpatition, transformation operation, action operation

7. The worker node reports the calculation status to the driver node

Create RDD with local parallelized collection
public class JavaLocalSumApp{
    public static void main(String[] args){
        SparkConf conf = new SparkConf().setAppName("JavaLocalSumApp");
        JavaSparkContext sc = new JavaSparkContext(conf);
        List<Integer> list = Arrays.asList(1,3,4,5,6,7,8);
        //Create RDD through local parallelized collection
        JavaRDD <Integer> listRDD = sc.parallelize(list);
        //Summation
        Integer sum = listRDD.reduce(new Function2<Integer,Integer,Integer,Integer>(){
            @Override
                public Integer call(Integer v1,Integer v2) throws Exception{
                return v1+v2;
            }
        }
        );
        System.out.println(sum)
    }
}
//Functional programming in java, the compiler needs to be set to 1.8
listRDD.reduce((v1,v2)=> v1+v2)

Sparktransformation and action operations

RDD: Elastic distributed data set, is a collection, supports multiple sources, has a fault tolerance mechanism, can be cached, supports parallel operations, an RDD represents a data set in a partition

RDD has two kinds of operation operators:

Transformation (Transformation): Transformation is a delay calculation. When one RDD is converted to another RDD, there is no immediate conversion. It is a logical operation to remember the data set.

Action (execution): trigger the execution of the Spark job, and truly trigger the calculation of the conversion operator

The role of spark operator
The operator is a function defined in RDD, which can transform and operate the data in RDD.

Input: When the Spark program is running, the data is input from the external data space (such as distributed storage: textFile to read HDFS, etc., and the parallelize method is input to the Scala collection or data). The data enters the Spark runtime data space and is converted into data in Spark Blocks, managed by BlockManager

Operation: After the Spark data is input to form the RDD, you can use transformation operators, such as filter. Operate on the data and convert the RDD to a new RDD. Use the Action operator to trigger Spark to submit the job. If the data needs to be reused, you can use the Cache operator to cache the data to memory

Output: The data at the end of the program will be output to the Spark runtime space and stored in distributed storage (such as saveAsTextFile output to HDFS), or Scala data or collection (collect is output to Scala collection, count returns Scala int data)

Overview of Transformation and Actions

Transformation

map(func): returns a new distributed data set, consisting of each original element transformed by the func function
filter(func): returns a new data set, by the func function
flatMap(func): similar to map, but each input element will be mapped to 0 to multiple output elements (thus, the return value of the func function is a Seq, not a single element)
sample(withReplacement, frac, seed): According to the given random seed seed, randomly sample the data of the amount of frac
union(otherDataset): returns a new data set, formed by combining the original data set and parameters
roupByKey([numTasks]): Called on a data set consisting of (K, V) pairs, returning a data set of (K, Seq[V]) pairs. Note: By default, 8 parallel tasks are used for grouping, you can pass in numTask optional parameters, according to the amount of data to set a different number of Task
reduceByKey(func, [numTasks]): used on a (K, V) pair of data sets, returns a (K, V) pair of data sets, the same key value, are aggregated together using the specified reduce function . Similar to groupbykey, the number of tasks can be configured through the second optional parameter.
join(otherDataset, [numTasks]): Called on datasets of type (K, V) and (K, W), returning a (K, (V, W)) pair, all elements in each key Datasets all together
groupWith(otherDataset, [numTasks]): Called on datasets of type (K,V) and (K,W), returns a dataset with the elements (K, Seq[V], Seq[W] ) Tuples. This operation is called CoGroup in other frameworks
cartesian(otherDataset): Cartesian product. But when called on the data sets T and U, it returns a (T, U) pair of data sets, all elements interact with Cartesian product.
Actions

reduce(func): Aggregate all elements in the data set through the function func. The Func function accepts 2 parameters and returns a value. This function must be associative to ensure that it can be executed correctly and concurrently
collect(): In the Driver program, all elements of the data set are returned in the form of an array. This will usually return a sufficiently small subset of data before using filter or other operations, and then directly return the entire RDD set Collect, which is likely to make the driver program OOM
count(): returns the number of elements in the data set
take(n): returns an array consisting of the first n elements of the data set. Note that this operation is currently not performed on multiple nodes in parallel, but on the machine where the Driver program is located, and all elements are calculated in a single machine (Gateway's memory pressure will increase and need to be used with caution)
first(): returns the first element of the data set (similar to take(1))
saveAsTextFile(path): Save the elements of the data set in the form of textfile to the local file system, hdfs or any other file system supported by hadoop. Spark will call the toString method of each element and convert it to a line of text in the file
saveAsSequenceFile(path): Save the elements of the data set in the sequencefile format to the specified directory, local system, hdfs or any other file system supported by hadoop. The elements of RDD must be composed of key-value pairs, and all implement the Writable interface of Hadoop, or implicitly can be converted to Writable (Spark includes basic type conversion, such as Int, Double, String, etc.)
foreach(func): On each element of the data set, run the function func. This is usually used to update an accumulator variable or interact with an external storage system
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.