The following are lessons learned from the Spark Rdd decryption course:
Before introducing the spark Rdd, simply say Hadoop MapReduce, which is calculated based on the data flow, loads the data from the physical storage, and then operates the data.
The last write to the physical storage device, such a pattern will produce a large number of intermediate results.
MapReduce is not suitable for scenes: 1. Not suitable for a large number of iterative computing scenarios, 2. Interactive queries (emphasis: Data flow-based methods cannot reuse intermediate calculations)
While the spark Rdd is based on a working set, the RDD name is: Elastic distributed datasets.
The elastic type of RDD is mainly divided into the following points:
1. Automatic storage and switching of memory and disk data;
2. High-efficiency fault-tolerant based on lineage;
3.Task if the failure will automatically make a specific number of retries;
4.Stage if the failure will automatically make a specific number of retries, and will only calculate the failed shards;
Fault tolerance for 5.Checkpoint and persist
6. Data scheduling resiliency, independent of DAG, Jobscheduler, etc.
7. The high elasticity of data Shard, can set the number of shards manually, set the Shard function: repartition default to Shuffle mechanism, you can choose Coalesce function for sharding setting
An RDD is a collection of data shards distributed across a cluster, with the same computational logic for each shard.
RDD General fault Tolerant mode: Checkpoint and record Data update mode
Why is rdd efficient by recording how data is updated?
1.RDD set is immutable, calculation is lazy mode
2.RDD updates are coarse-grained, and write operations can be coarse-grained or fine-grained
Rdd defects : 1. Fine-grained update operations are not supported;
2. Incremental iteration calculation is not supported;
Note:
Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)-IMF
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
Life was short,you need to Spark.
This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1771134
14th Lesson: Spark Rdd Decryption