Spark Brief and basic architecture

Source: Internet
Author: User
Tags shuffle hadoop mapreduce

Spark Brief

Spark originates from the cluster computing platform at the University of California, Berkeley, Amplab. It based
In memory computing, from multi-iteration batch processing, eclectic data warehouse, stream processing and graph calculation and other computational paradigms.
Features:
1, light
The Spark 0.6 core code has 20,000 lines, Hadoop1.0 is 90,000 lines, and 2.0 is 220,000 lines.

2, fast
Spark can reach sub-second latency for small datasets, which is unthinkable for Hadoop MapReduce (due to the "heartbeat" interval mechanism, there is a delay of only a few seconds for the task to start)

3. Spirit
At the implementation level, it perfectly interprets the Scala trait dynamic mix strategy (e.g., a replaceable cluster scheduler, a serialized library);
At the primitive level, it allows the extension of new data operators, new data sources, new language bindings (Java and Python);
At the paradigm level, spark supports many paradigms such as memory computation, multi-iteration batch processing, stream processing, and graph calculation.

4, Qiao
Skillfully in occasion and borrowing power. Spark is a seamless integration with Hadoop, leveraging Hadoop's potential.

why does spark perform faster than Hadoop?

1. Hadoop Data Extraction Operation model

Data extraction operations are based on disk, and intermediate results are stored on disk. The Mr Operation is accompanied by a large amount of disk IO.

2, Spark uses memory instead of traditional HDFS storage intermediate results

The first generation of Hadoop uses HDFS to store intermediate results, and the second with Hadoop joins the cache to save intermediate results. And Spark is a memory-based intermediate dataset store. Spark can be understood as an upgraded version of Hadoop, Spark is compatible with Hadoop APIs, and can read Hadoop's data file format, including Hdfs,hbase.

Spark on standalone running process(client mode)

1. Sparkcontext Connect to master, register with master and request resources (CPU Core and memory)

2, Master according to the requirements of sparkcontext resources and the information reported in the worker's heartbeat cycle determine on which worker to allocate resources, and then obtain resources on that worker. Then start Standaloneexecutorbackend.

3, Standaloneexecutorbackend to Sparkcontext registration.

4, Sparkcontext will applicaiton code standaloneexecutorbackend, and Sparkcontext parse Applicaiton code, build DAG diagram, and submit to Dag Scheduler breaks down into the stage (the job is spawned when an action is encountered, and each job contains 1 or more stage,stage that are typically generated before external data and shuffle are acquired), It is then submitted to the task Scheduler as stage (or Taskset),
Task Scheduler is responsible for assigning the task to the appropriate worker and finally submitting it to standaloneexecutorbackend for execution;

5. Standaloneexecutorbackend will set up the executor thread pool, start executing the task, and report to Sparkcontext until the task is completed.

6. After all tasks are completed, Sparkcontext logs off to master and frees up resources.

Spark on YARN run process(Cluster mode)

1. The user submits application to yarn via Bin/spark-submit (Spark1.0.0 application Deployment Tool Spark-submit) or Bin/spark-class.

2. RM assigns the first container to application and initiates sparkcontext on the container of the specified node.

3, Sparkcontext to RM application resources to run executor.

4, RM allocation container to Sparkcontext,Sparkcontext and related NM communication, on the container to get started Standaloneexecutorbackend, After the standaloneexecutorbackend is started, register with Sparkcontext and request a task.

5. Sparkcontext assign task to standaloneexecutorbackend perform standaloneexecutorbackend** execution task** and report Health to Sparkcontext

6, the task is completed, Sparkcontext return the resources to RM, and log out.

about Rdd

The RDD (resilient distributed Datasets) elastic distributed data set has the following features:
1, it is an immutable, partitioned collection object on the cluster node.
2, through the parallel transformation of the way to create, such as map, filter, join and so on.
3, failure automatic reconstruction.
4, you can control the storage level (memory, disk, etc.) for reuse.
5, must be serializable.
6, is the static type.

The RDD is essentially a computational unit that knows its parent cell.

The RDD is the basic unit of the parallel operation of Spark. The RDD provides four types of operators:

1, input operator : convert raw data into RDD, such as parallelize, txtfile, etc.

2. Conversion operator : The main operator is the object of Spark generating DAG graph.
The conversion operator is not executed immediately, and is then submitted to driver processing after triggering the action operator, generating a DAG diagram –> stage–> task–> worker execution.
According to the conversion operator in the DAG map function, can be divided into two kinds :
Narrow Dependency Operators
Input and output a one-to-one operator, and the result of the RDD partition structure is not changed , mainly map, FlatMap;
Input and output one-to-one operators, but the results of the RDD partition structure has changed, such as Union, coalesce;
Select the operator of some element from the input, such as filter, distinct, subtract, sample.
Wide dependency operator
Wide dependencies involve the shuffle class, which produces the stage at the time of the Dag diagram resolution.
A single RDD is based on key reorganization and reduce, such as Groupbykey, Reducebykey;
Join and reorganize two RDD based on key, such as join, Cogroup.

3, Cache operators : For the RDD to be used more than once, can be buffered to speed up the speed of operation, the important data can be used for multi-backup cache.

4, action operator : The results of the operation of the RDD converted into raw data, such as count, reduce, collect, saveastextfile and so on.

WordCount Example:

1, Initialize, build Sparkcontext.

val ssc=new SparkContext(args(0),"WordCount",System.getenv("SPARK_HOME"),Seq(System.getenv("SPARK_EXAMPLES_JAR")))

2. Input operator

val lines=ssc.textFlle(args(1))

3, Variable conversion sub

words=lines.flatMap(x=>x.split(" "))

4. Cache operators

words.cache() //缓存

5, Variable conversion sub

val wordCounts=words.map(x=>(x,1))val red=wordCounts.reduceByKey((a,b)=>{a+b})

6. Action operator

red.saveAsTextFile("/root/Desktop/out")
The RDD supports two types of operations:

Transform (transformation) to create a new dataset from an existing dataset
Action (actions) returns a value to the driver after the calculation is run on the dataset

For example, a map is a transformation that passes each element of a dataset to a function and returns a new distribution dataset that represents the result. And reduce is an action that stacks all the elements together and returns the final result to the driver program through some functions. (But there is also a parallel reducebykey that can return a distributed data set)

All transitions in spark are inert, that is, they do not directly evaluate the result. Instead, they just remember these transformation actions applied to the underlying dataset, such as a file. These conversions will only actually run if a request is taken to return the result to driver. This design allows spark to run more efficiently.
For example, we can implement: a new dataset created from map and used in reduce, and ultimately only the result of reduce is returned to driver, not the entire large new dataset.

by default, each converted Rdd will be recalculated when you perform an action on top of it. However, you can also use the persist (or cache) method to persist an RDD in memory. In this case, spark will save the relevant element in the cluster, and the next time you query the RDD it will be able to access it more quickly. Persisting a dataset on disk or replicating datasets between clusters is also supported.

RDD conversions and actions supported in spark

Note:
Some operations are only available for key-value pairs, such as join. In addition, the function name matches the APIs in Scala and other functional languages, for example, map is a one-to-one mapping, whereas FLATMAP maps each input to a single or multiple output (similar to the map in MapReduce).
In addition to these operations, users can request that the RDD be cached. Also, users can get the partition order of the RDD through the Partitioner class and then partition the other rdd in the same way. Some operations automatically generate a hash or range of the RDD, such as Groupbykey,reducebykey and sort.

Run and Dispatch


The first stage records the transformation operator sequence, and the band Building Dag diagram.

The second stage is triggered by the action operator , dagscheduler the Dag graph into the job and its task set. Spark supports local single-node running (useful for development debugging) or cluster operation. For the cluster to run, the client runs on the master band Point and sends the partitioned task set to the Worker/slave node of the cluster via cluster manager.

Configuration

Spark Eco-system

Spark Brief and basic architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.