14th Lesson: Spark Rdd Decryption

Source: Internet
Author: User
Tags hadoop mapreduce spark rdd

The following are lessons learned from the Spark Rdd decryption course:

Before introducing the spark Rdd, simply say Hadoop MapReduce, which is calculated based on the data flow, loads the data from the physical storage, and then operates the data.

The last write to the physical storage device, such a pattern will produce a large number of intermediate results.

MapReduce is not suitable for scenes: 1. Not suitable for a large number of iterative computing scenarios, 2. Interactive queries (emphasis: Data flow-based methods cannot reuse intermediate calculations)

While the spark Rdd is based on a working set, the RDD name is: Elastic distributed datasets.

The elastic type of RDD is mainly divided into the following points:

1. Automatic storage and switching of memory and disk data;

2. High-efficiency fault-tolerant based on lineage;

3.Task if the failure will automatically make a specific number of retries;

4.Stage if the failure will automatically make a specific number of retries, and will only calculate the failed shards;

Fault tolerance for 5.Checkpoint and persist

6. Data scheduling resiliency, independent of DAG, Jobscheduler, etc.

7. The high elasticity of data Shard, can set the number of shards manually, set the Shard function: repartition default to Shuffle mechanism, you can choose Coalesce function for sharding setting

An RDD is a collection of data shards distributed across a cluster, with the same computational logic for each shard.

RDD General fault Tolerant mode: Checkpoint and record Data update mode

Why is rdd efficient by recording how data is updated?

1.RDD set is immutable, calculation is lazy mode

2.RDD updates are coarse-grained, and write operations can be coarse-grained or fine-grained


Rdd defects : 1. Fine-grained update operations are not supported;

2. Incremental iteration calculation is not supported;

Note:

Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)-IMF

For more private content, please follow the public number: Dt_spark

If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

Life was short,you need to Spark.


This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1771134

14th Lesson: Spark Rdd Decryption

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.