Spark Brief
Spark originates from the cluster computing platform at the University of California, Berkeley, Amplab. It based
In-memory computing. From the multi-iteration batch processing, the eclectic data warehouse, stream processing and graph calculation are all kinds of computational paradigms.
Features:
1, light
The Spark 0.6 core code has 20,000 lines, Hadoop1.0 is 90,000 lines, and 2.0 is 220,000 lines.
2, fast
Spark can reach sub-second latency for small datasets, which is unthinkable for Hadoop MapReduce (because of the "heartbeat" interval mechanism, a delay of only a few seconds for a task to start)
3. Spirit
At the implementation level, it perfectly interprets the Scala trait dynamic mix strategy (e.g., a replaceable cluster scheduler, a serialized library);
At the primitive level, it agreed to extend new data operators, new data sources, new language bindings (Java and Python).
At the paradigm level, spark supports many paradigms such as memory computation, multi-iteration batch processing, stream processing, and graph calculation.
4, Qiao
Skillfully in occasion and borrowing power.
Spark is a seamless integration with Hadoop, leveraging Hadoop's potential.
why does spark perform faster than Hadoop?
1. Hadoop Data Extraction Operation model
Data extraction operations are based on disk, and intermediate results are stored on disk. The Mr Operation is accompanied by a large amount of disk IO.
2, Spark uses memory instead of traditional HDFS storage intermediate results
The first generation of Hadoop uses HDFS to store the intermediate results completely, and the second with Hadoop adds the cache to hold the intermediate results. And Spark is a memory-based intermediate dataset store. The ability to understand spark as the upgrade version number for Hadoop, Spark is compatible with the Hadoop API, and can read Hadoop's data file format, including Hdfs,hbase.
Spark on standalone execution process(client mode)
1. Sparkcontext Connect to master, register with master and request resources (CPU Core and memory)
2, Master according to the Sparkcontext resource request and the worker Heart rate report information to decide on which worker to allocate resources. Then get the resource on the worker, and then start Standaloneexecutorbackend.
3, Standaloneexecutorbackend to Sparkcontext register.
4, Sparkcontext will applicaiton code standaloneexecutorbackend, and Sparkcontext parse Applicaiton code, build dag diagram. and submitted to the DAG Scheduler decomposition into the stage (when the action action is encountered, the job is spawned.) Each job contains 1 or more stage,stage that are typically generated prior to acquiring external data and shuffle. It is then submitted to the task Scheduler as stage (or Taskset),
Task Scheduler is responsible for assigning the task to the corresponding worker and finally submitting it to standaloneexecutorbackend for execution;
5, Standaloneexecutorbackend will establish executor thread pool. Start executing the Task and report to Sparkcontext. Until the task is complete.
6. After all the tasks have been completed. Sparkcontext logs off to master. Frees resources.
Spark on YARN execution Process(Cluster mode)
1. The user submits application to yarn via Bin/spark-submit (Spark1.0.0 application Deployment Tool Spark-submit) or Bin/spark-class.
2. RM assigns the first container to application and initiates sparkcontext on the container of the specified node.
3, Sparkcontext to RM to apply for resources to perform executor.
4, RM allocation container to Sparkcontext,Sparkcontext and related NM communication, on the container to get started Standaloneexecutorbackend, After the Standaloneexecutorbackend is started, you start to register with Sparkcontext and apply for a task.
5, Sparkcontext assign task to Standaloneexecutorbackend execution standaloneexecutorbackend** execution task** and report to sparkcontext the status of execution
6. Task execution is complete. Sparkcontext returns the resource to RM and unregisters the exit.
Rdd Simple Introduction
The RDD (resilient distributed Datasets) elastic distributed data set has several features such as the following:
1, it is an immutable, partitioned collection object on the cluster node.
2, by the way of parallel conversion to create. such as map, filter, join, etc.
3, the failure of their own initiative to rebuild.
4. Ability to control storage levels (memory, disk, etc.) for reuse.
5, must be serializable.
6, is the static type.
The RDD is essentially a computational unit. Be able to know its parent unit of calculation.
The RDD is the basic unit of the parallel operation of Spark.
The RDD provides four types of operators:
1, input operator : convert raw data into RDD, such as parallelize, txtfile, etc.
2. Conversion operator : The most basic operator is the object of Spark generating DAG graph.
The conversion operator is not executed immediately, and is then submitted to driver processing after triggering the action operator, generating a DAG diagram –> stage–> task–> worker execution.
Press the conversion operator to function in the DAG diagram. Can be divided into two kinds :
Narrow Dependency Operators
Input and output a one-to-one operator, and the result of the RDD partition structure is not changed , mainly map, FlatMap;
Input and output one-to-one operators, but the results of the RDD partition structure has changed, such as Union, coalesce;
The operator that selects the partial element from the input. such as filter, distinct, subtract, sample.
Wide dependency operator
Wide dependencies involve the shuffle class, which produces the stage at the time of the Dag diagram resolution.
A single RDD is based on key reorganization and reduce, such as Groupbykey, Reducebykey;
Joins and reorganizes two RDD based on key. such as join, Cogroup.
3. Cache operator : for the RDD to be used multiple times, it can buffer the speed of execution, and can adopt multi-backup cache for key data.
4, action operator : The results of the operation of the RDD converted into raw data, such as count, reduce, collect, saveastextfile and so on.
examples of WordCount:
1, initialization. build Sparkcontext.
val ssc=new SparkContext(args(0),"WordCount",System.getenv("SPARK_HOME"),Seq(System.getenv("SPARK_EXAMPLES_JAR")))
2. Input operator
val lines=ssc.textFlle(args(1))
3, Variable conversion sub
words=lines.flatMap(x=>x.split(" "))
4. Cache operators
words.cache() //缓存
5, Variable conversion sub
val wordCounts=words.map(x=>(x,1))val red=wordCounts.reduceByKey((a,b)=>{a+b})
6. Action operator
red.saveAsTextFile("/root/Desktop/out")
The RDD supports two types of operations:
Transform (transformation) to create a new dataset from an existing dataset
Action (actions) returns a value to the driver after the calculation is performed on the dataset
Like what. A map is a transformation that passes each element of a dataset to a function and returns a new distribution dataset that represents the result. And reduce is an action . The entire element is superimposed by some functions, and the result is returned to the driver program. (Just another parallel reducebykey.) Can return a distributed data set)
All conversions in spark are inert, that is, they do not directly calculate the result. On the contrary. They simply remember these transformation actions applied to the underlying data set (such as a file). These conversions will only be performed when an action is required to return the result to driver. This design allows spark to perform more efficiently.
For example, we can implement: a new dataset created from map and used in reduce, and finally just return the results of reduce to driver, rather than the entire large new data set.
by default, each converted Rdd is evaluated again when you perform an action on top of it.
It's just that. You can also use the persist (or cache) method to persist an RDD in memory.
In such a case, spark will be in the cluster. Save the related element. Next time you check this rdd. It will be able to access the high-speed interview.
Persisting a dataset on disk or replicating datasets between clusters is also supported.
RDD conversions and actions supported in spark
Note:
Some operations are only available for key-value pairs, such as join.
Other than that. The function name matches the APIs in Scala and other functional languages, for example, map is a one-to-many mapping, and flatmap is mapping each input to a single or multiple output (similar to the map in MapReduce).
In addition to these operations, users can request that the RDD be cached. And. The user is also able to get the partition order of the RDD through the Partitioner class and then partition the other rdd in the same way. Some operations will voluntarily generate a hash or range of the RDD, such as Groupbykey,reducebykey and sort.
execution and scheduling
The first stage records the transformation operator sequence, and the band Building Dag diagram.
The second stage is triggered by the action operator , dagscheduler the Dag graph into the job and its task set. Spark supports local single-node execution (Dev-debug utility) or cluster execution. For cluster execution, the client executes on the master band Point and sends the partitioned task set to the Worker/slave node of the cluster via cluster manager.
Configuration
Spark Eco-system
Spark Brief and basic architecture