Introduction to the basic concepts and features of spark

Source: Internet
Author: User

1. What is Spark?

0 High Scalability

0 High fault tolerance

0 Memory-based computing

2. Spark's ecosystem (Bdas, Chinese: UC Berkeley analysis stack)

0MapReduce belongs to one of the hadoop ecosystems, and Spark is one of the bdas ecosystems

0Hadoop includes MapReduce, HDFS, HBase, Hive, Zookeeper, Pig, Sqoop, etc.

0BDAS includes Spark, Shark (equivalent hive), BLINKDB, spark streaming (message real-time processing framework, like Storm), etc.

0BDAS Eco-System diagram:

3. Spark and MapReduce

Advantage:

0MapReduce usually put intermediate results on HDFs, Spark is a memory-based parallel big data framework, where intermediate results are stored in memory, and spark efficiency is high for iterative data.

0MapReduce always consumes a lot of time sorting, and some scenarios do not need to be sorted, and spark can avoid the overhead of unnecessary sorting

0Spark is a forward-and-no-loop diagram (a topology from which a point is eventually unable to go back to that point) and is optimized.

4. Spark-supported APIs

Scala, Python, Java, etc.

5. Operation mode

0Local (for testing, development)

0Standlone (standalone cluster mode)

0Spark on yarn (Spark in yarn)

0Spark on Mesos (Spark in Mesos)

6. Spark at runtime

The driver program launches multiple Worker,worker to load data from the file system and generate an RDD (that is, the data is put into the RDD, the RDD is a data structure) and caches into memory according to different partitions.

7. RDD

0 English Name: Resilient distributed Dataset

0 Chinese name: Elastic distributed Data set

0 What is an RDD? The RDD is a collection of read-only, partitioned records that you can think of as a data structure that stores it! In spark everything based on the RDD

The 0RDD can be created in several ways:

1. Set Conversion

2. Input from File system (local file, HDFS, HBase)

3. Convert from parent Rdd (why do I need a parent rdd?) Fault tolerance, mentioned below)

Calculation type of 0RDD:

1, Transformation: Deferred execution, an RDD generated by the operation of the new RDD will not be executed immediately, only wait until the action action will actually execute.

2, Action: Submit spark job, when action, transformation type of operation will actually perform the calculation operation, and then produce the final result output.

3. Hadoop provides data interfaces for processing with maps and reduce, and spark offers more than just map and reduce, as well as more interfaces to data processing, as shown below:

8, fault-tolerant lineage

8.1. Basic concept of fault tolerance

0 Each RDD will record the parent Rdd it relies on, and once some partition of some rdd are lost, it can be quickly recovered by parallel computation

8.2, Narrow Dependent (narrow dependence) and wide Dependent (wide dependence)

The dependence of 0RDD is divided into narrow Dependent (narrow dependence) and wide Dependent (wide dependence).

0 Narrow dependency: Each partition can be used at most one rdd, because there is no multiple dependencies, so on one node can be processed at once partition, and once the data loss or damage can be quickly recovered from the previous Rdd

0 Wide dependency: Each partition can be used for multiple rdd, due to multiple dependencies, only wait until all the data reached the node processing to complete the next process, once the data loss or corruption, it is finished, So before this happens, the data from all the previous nodes must be materialized (stored on disk) to be processed so that recovery is achieved.

0 wide, narrow dependency example diagram:

9. Cache Policy

Spark consists of 11 cache policies, Usedisk, usememory, deserialized, and Replication4.

Usedisk: Using Disk Cache (Boolean)

Usememory: Using Memory Cache (Boolean)

Deserialized: Deserialization (serialization is for the network to transfer objects, Boolean:true deserialization \false serialization)

Replication: number of replicas (int)

The structure of the Storagelevel class is controlled by means of the construction of the parameters, as follows:

Class Storagelevel Private (Usedisk:boolean, Usememory:boolean, Deserialized:boolean, Replication:ini)

10, the way of submission

0spark-submit (official recommendation)

0SBT Run

0java-jar

Various parameters can be specified when committing

?

1 2 3 4 5 6 7 8   ./bin/spark-submit     -- class  <main- class >    --master <master-url>  --deploy-mode <deploy-mode>  --conf <key> = <value>  ...  #  other options <application-jar>  [application-arguments]

For example:

For more detailed Submit-spark reference official documentation: Http://spark.apache.org/docs/latest/submitting-applications.html


Introduction to the basic concepts and features of spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.