Spark can read and write data directly to HDFS and also supports Spark on YARN. Spark runs in the same cluster as MapReduce, shares storage resources and calculations, borrows Hive from the data warehouse Shark implementation, and is almost completely compatible with Hive.
Spark's core concept
1, Resilient Distributed Dataset (RDD) elastic distribution data set
RDD is the most basic abstraction of Spark, which is an abstract use of distributed memory that implements an abstract implementation of manipulating distributed data sets in a manner that manipulates local collections. RDD Spark is the core thing, which means that has been partitioned, immutable and can be manipulated in parallel data collection, different data set format corresponding to different RDD implementation. RDD must be serializable. RDDs can be cached in memory and each result of an RDD dataset operation can be stored in memory. The next operation can be imported directly from memory, eliminating the need for a large number of MapReduce disk IO operations. This is more common for iterative computing machine learning algorithms, interactive data mining, the efficiency is relatively large.
RDD features:
It is an immutable, partitioned collection object on a cluster node. By way of parallel conversion to create such as (map, filter, join, etc). Failure to rebuild automatically. You can control the storage level (memory, disk, etc.) for reuse. Must be serializable. Is a static type.
The benefits of RDD
RDDs can only be produced from persistent storage or through Transformations operations, allowing for more efficient fault tolerance than distributed shared memory (DSM), recalculating missing part of a data partition based on its lineage without having to be specific Checkpoint. The invariance of RDD enables the speculative execution of Hadoop-like MapReduce. The RDD data partitioning feature improves performance by locality of data, as is the case with Hadoop MapReduce. RDDs are serializable, automatically degraded to disk storage when out of memory, and RDDs are stored on disk, and performance drops significantly but not worse than what MapReduce is.
RDD storage and partitioning:
Users can choose different storage levels to store RDDs for reuse. The current RDD is stored in memory by default, but RDD spills to disk when there is not enough memory. RDD divides each record Key (for example, Hash partition) when it needs to be partitioned to distribute the data in the cluster, so as to ensure that the two data sets are efficient when being joined.
RDD's internal representation:
Partition list (list of data blocks) Calculate the function for each slice (this RDD is calculated from the parent RDD) Dependency list for the parent RDD Partitioner for the key-value RDD [Optional] Predefined address list for each data slice (Such as the address of the data block on HDFS) [Optional]
RDD storage level: The RDD provides 11 storage levels based on a combination of useDisk, useMemory, deserialized, and replication parameters. RDD defines a variety of operations, different types of data by different RDD class abstract representation, different operations are also implemented by the RDD.
There are two ways to create an RDD:
Input (eg, HDFS) creation from a Hadoop file system (or other Hadoop compatible storage system). Get a new RDD from the parent RDD.
2, Spark On Mesos
Spark supports both local calling and Mesos clustering. Algorithm programs are developed on Spark. After the local mode is successfully debugged, Mesh clusters can be directly used instead of files. In addition to the file save location, the algorithm does not need to be modified theoretically . Spark's native mode supports multi-threading and has some standalone concurrent processing capabilities. But not very strong. Local mode can save results in the local or distributed file system, and Mesos mode must be stored in a distributed or shared file system.
In order to run on the Mesos framework and to install Mesos specifications and designs, Spark implements two classes, one SparkScheduler, a Spark class named MesosScheduler, and a SparkExecutor class named Executor in Spark. With these two classes, Spark can do distributed computing through Mesos. Spark converts the RDD and MapReduce functions into a standard job and a series of tasks. Submitted to the SparkScheduler, SparkScheduler submits the Task to the Mesos Master, assigned by the Master to a different Slave, and finally by the Spark Executor in the Slave, will be assigned to the Task one by one and return to form a new RDD, or write directly To the distributed file system.
3, Transformations & Actions
There are two ways RDD can be calculated: conversion (return value or an RDD) and operation (the return value is not an RDD).
Transformations (such as: map, filter, groupBy, join, etc.), Transformations operation is Lazy, which means that the operation of generating another RDD from one RDD conversion is not executed immediately, Spark only records Transformations operation Need such an operation, and will not go to the implementation, need to wait until there is an Actions operation will really start the calculation process to calculate. Actions (such as: count, collect, save, etc.), the Actions action returns the result or writes the RDD data to the storage system. Actions is the trigger to trigger Spark to start the calculation.
The essential difference between them is: Transformation return value or an RDD. It uses the chain of call design patterns, an RDD calculated, transformed into another RDD, then the RDD can be another conversion. This process is distributed. Action return value is not an RDD. It is either a normal Scala collection, either a value, either null, eventually returned to the Driver program, or written to the file system. For further details on these two actions in the Spark Developer's Guide, they are at the heart of Spark-based development. Here's a picture of Spark's official ppt slightly modified to clarify the difference between the two actions.
4, Lineage (blood)
The use of memory to speed up data loading is also implemented in many other In-Memory databases or Cache classes. The main difference between Spark is that it addresses the issue of data fault tolerance (node effectiveness / data loss) in distributed computing environments Program. To ensure the robustness of the data in the RDD, the RDD dataset remembers how it evolved from other RDDs through the so-called lineage. The RDagem Lineage records coarse-grained specific data transformation (filter, map, join etc.) behavior compared to the fine-grained memory data update level backup or LOG mechanism of other systems. When partial data from this RDD is lost, it can get enough information through Lineage to recompute and recover lost data partitions. This coarse-grained data model limits the use of Spark, but at the same time provides performance improvements over the fine-grained data model.
RDD is divided into two types of Narrow Dependencies and Wide Dependencies for Lineage Dependencies to solve the problem of data fault tolerance.
Narrow Dependencies refers to each partition of the parent RDD used by a sub-RDD partition at most, as a partition of a parent RDD corresponds to a partition of a sub-RDD or a partition of a plurality of parent RDD corresponds to a partition of a sub-RDD, that is, It is impossible for a partition of a parent RDD to correspond to multiple partitions of a child RDD. Wide Dependencies refers to sub-RDD partition depends on the parent RDD multiple partitions or all partitions, that is, there is a parent RDD a partition corresponds to a sub-RDD multiple partitions. For Wide Dependencies, this calculation of input and output on different nodes, lineage method is good with the input node, and output node downtime, through recalculation, this case, this method is fault-tolerant is valid, Otherwise it is invalid, because it can not be retried, it needs to trace its ancestors backwards to see if it is possible to retry (this is lineage, lineage meaning). Narrow Dependencies has much less reckoning cost for data than Wide Dependencies.
In the RDD calculation, fault-tolerant through checkpint checkpoint there are two ways, one is checkpoint data, one is logging the updates. The user can control which approach is used to achieve fault tolerance. The default is logging the updates, which recalculates the missing partition data by keeping track of all the transformations that generated the RDD, that is, the lineage of each RDD.
Spark resource management and job scheduling
Spark For resource management and job scheduling can be achieved using local mode, standalone (standalone mode), Apache Mesos and Hadoop YARN. Spark on Yarn is referenced at Spark 0.6, but it is actually available in the current branch-0.8 release. Spark on Yarn follows YARN's official specification implementation and thanks to Spark's natural support for well-designed Scheduler and Executor, YARN support is as easy as Spark on Yarn's general framework.
Having Spark run on YARN shares cluster resources with Hadoop to improve resource utilization.
Programming interface
Spark exposes RDD operations through integration with a programming language, similar to DryadLINQ and FlumeJava, where each dataset is represented as an RDD object and operations on the dataset represent pairs of RDD objects. Spark's main programming language is Scala. Scala is chosen for its simplicity (Scala is easy to use interactively) and performance (a static strongly typed language on the JVM).
Spark is similar to Hadoop MapReduce, and consists of Master (Jobtracker similar to MapReduce) and Workers (Spark's Slave Worker Node). The Spark program written by the user is called the Driver program. The Dirver program connects to the master and defines the conversion and operation of each RDD. The conversion and operation of the RDD is represented by the Scala closure (literal function). Scala uses Java objects To represent closures and are all serializable, in order to send closures to RDDs to worker nodes. Workers store data chunks and enjoy cluster memory. A daemon running on a worker node, when it receives an operation on an RDD, performs localized data operations based on the data shard information to generate new data shards and returns Result or write RDD to storage system.
Scala: Spark was developed using Scala and uses Scala as the programming language by default. Writing a Spark program is a lot easier than writing a Hadoop MapReduce program. SparK provides Spark-Shell, which you can test in the Spark-Shell program. The general steps for writing a SparK program are to create or use an instance of (SparkContext), create an RDD using SparkContext, and then operate an RDD. Java: Spark support for Java programming, but for the use of Java Spark-Shell is not such a handy tool, and other programming is the same as Scala, because both languages on the JVM, Scala and Java can interoperate, Java programming interface is actually Encapsulation of Scala. Such as: Python: Spark also provides a Python programming interface, Spark uses py4j python to achieve interoperability with java, in order to achieve the use of python prepared Spark program. Spark also offers pyspark, a Spark's python shell, that lets you write Spark programs in Python interactively.
Spark ecosystem
Shark (Hive on Spark): Shark basically provides Hive-like H iveQL command interface based on Spark's framework. In order to maintain compatibility with Hive to the maximum, Shark uses Hive's API for query parsing and Logic Plan generation, the last PhysicalPlan execution phase to replace Hadoop MapReduce with Spark. By configuring Shark parameters, Shark automatically caches specific RDDs in memory for data reuse, which speeds up the retrieval of specific data sets. In the meantime, Shark implements specific data analysis learning algorithms through UDF user-defined functions that allow SQL data queries and computational analysis to be combined to maximize RDD reuse. Spark streaming: Building a framework for processing Stream data on Spark, the basic principle is to split the Stream data into small chunks of time (a few seconds) to process this small portion of data in a batch-like manner. Spark Streaming builds on Spark because, on the one hand, Spark's low-latency execution engine (100ms +) can be used for real-time calculations and RDD datasets more easily and efficiently than other Record-based processing frameworks such as Storm Fault-tolerant processing. In addition, small-batch processing makes it compatible with both logic and algorithms for both batch and real-time data processing. Convenient for some applications that require historical data and real-time data joint analysis of specific applications. Bagel: Pregel on Spark, Spark calculations can be used, this is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.
Spark for the scene
Spark is a memory-based, iterative computing framework for applications that need to manipulate specific data sets multiple times. The more iterations are required, the greater the amount of data that needs to be read, the greater the benefit, the smaller the amount of data, and the less computationally intensive, the benefit is relatively small. Due to the nature of RDD, Spark does not apply to that type of asynchronous Application of fine-grained update state, such as web service storage or incremental web crawlers and indexes. It is not suitable for the incremental modification of the application model. In general, Spark is more applicable and more generic.
Use in the industry
Spark project started in 2009, open source in 2010, now using: Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others, Taobao, Douban is also using Spark's Python clone Dpark.
Reference: http://spark.apache.org/