Introduction to the basic concepts and features of spark

Last Update:2015-06-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is Spark?

0 High Scalability

0 High fault tolerance

0 Memory-based computing

2. Spark's ecosystem (Bdas, Chinese: UC Berkeley analysis stack)

0MapReduce belongs to one of the hadoop ecosystems, and Spark is one of the bdas ecosystems

0Hadoop includes MapReduce, HDFS, HBase, Hive, Zookeeper, Pig, Sqoop, etc.

0BDAS includes Spark, Shark (equivalent hive), BLINKDB, spark streaming (message real-time processing framework, like Storm), etc.

0BDAS Eco-System diagram:

3. Spark and MapReduce

Advantage:

0MapReduce usually put intermediate results on HDFs, Spark is a memory-based parallel big data framework, where intermediate results are stored in memory, and spark efficiency is high for iterative data.

0MapReduce always consumes a lot of time sorting, and some scenarios do not need to be sorted, and spark can avoid the overhead of unnecessary sorting

0Spark is a forward-and-no-loop diagram (a topology from which a point is eventually unable to go back to that point) and is optimized.

4. Spark-supported APIs

Scala, Python, Java, etc.

5. Operation mode

0Local (for testing, development)

0Standlone (standalone cluster mode)

0Spark on yarn (Spark in yarn)

0Spark on Mesos (Spark in Mesos)

6. Spark at runtime

The driver program launches multiple Worker,worker to load data from the file system and generate an RDD (that is, the data is put into the RDD, the RDD is a data structure) and caches into memory according to different partitions.

7. RDD

0 English Name: Resilient distributed Dataset

0 Chinese name: Elastic distributed Data set

0 What is an RDD? The RDD is a collection of read-only, partitioned records that you can think of as a data structure that stores it! In spark everything based on the RDD

The 0RDD can be created in several ways:

1. Set Conversion

2. Input from File system (local file, HDFS, HBase)

3. Convert from parent Rdd (why do I need a parent rdd?) Fault tolerance, mentioned below)

Calculation type of 0RDD:

1, Transformation: Deferred execution, an RDD generated by the operation of the new RDD will not be executed immediately, only wait until the action action will actually execute.

2, Action: Submit spark job, when action, transformation type of operation will actually perform the calculation operation, and then produce the final result output.

3. Hadoop provides data interfaces for processing with maps and reduce, and spark offers more than just map and reduce, as well as more interfaces to data processing, as shown below:

8, fault-tolerant lineage

8.1. Basic concept of fault tolerance

0 Each RDD will record the parent Rdd it relies on, and once some partition of some rdd are lost, it can be quickly recovered by parallel computation

8.2, Narrow Dependent (narrow dependence) and wide Dependent (wide dependence)

The dependence of 0RDD is divided into narrow Dependent (narrow dependence) and wide Dependent (wide dependence).

0 Narrow dependency: Each partition can be used at most one rdd, because there is no multiple dependencies, so on one node can be processed at once partition, and once the data loss or damage can be quickly recovered from the previous Rdd

0 Wide dependency: Each partition can be used for multiple rdd, due to multiple dependencies, only wait until all the data reached the node processing to complete the next process, once the data loss or corruption, it is finished, So before this happens, the data from all the previous nodes must be materialized (stored on disk) to be processed so that recovery is achieved.

0 wide, narrow dependency example diagram:

9. Cache Policy

Spark consists of 11 cache policies, Usedisk, usememory, deserialized, and Replication4.

Usedisk: Using Disk Cache (Boolean)

Usememory: Using Memory Cache (Boolean)

Deserialized: Deserialization (serialization is for the network to transfer objects, Boolean:true deserialization \false serialization)

Replication: number of replicas (int)

The structure of the Storagelevel class is controlled by means of the construction of the parameters, as follows:

Class Storagelevel Private (Usedisk:boolean, Usememory:boolean, Deserialized:boolean, Replication:ini)

10, the way of submission

0spark-submit (official recommendation)

0SBT Run

0java-jar

Various parameters can be specified when committing

1 2 3 4 5 6 7 8 ./bin/spark-submit -- class <main- class > --master <master-url> --deploy-mode <deploy-mode> --conf <key> = <value> ... # other options <application-jar> [application-arguments]

For example:

For more detailed Submit-spark reference official documentation: Http://spark.apache.org/docs/latest/submitting-applications.html

Introduction to the basic concepts and features of spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to the basic concepts and features of spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to the basic concepts and features of spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support