Spark Overview
Spark is a general-purpose large-scale data processing engine. Can be simply understood as Spark is a large data distributed processing framework.
Spark is a distributed computing framework based on the map reduce algorithm, but the Spark intermediate output and result output can be stored in memory, thus no longer need to read and write HDFs, so spark can be better used for data mining and machine learning, such as the need for iterative map The algorithm of reduce. Spark Ecological System Bdas
Berkeley will Spark's entire ecosystem known as the Berkeley Data Analysis Stack (bdas). Its core framework is spark, while Bdas covers query engine spark SQL that supports structured data SQL query and analysis, which provides machine learning function mlbase and the underlying distributed machine Learning Library mllib, Parallel Graph computing Framework GRAPHX, Flow computing framework Spark Streaming, sampling approximate computation query engine BLINKDB, memory Distributed File System Tachyon, resource management framework Mesos, etc. sub projects. These subprojects provide a higher level and richer computational paradigm on the Upper spark.
(1) Spark
Spark is the core component of the whole bdas, it is a large data distributed programming framework, which not only realizes the MapReduce operator map function and reduce function and calculation model, but also provides richer operators, such as filter, join, Groupbykey, etc. Spark abstracts distributed data into resilient distributed Datasets (RDD), implements task scheduling, RPC, serialization, and compression, and provides APIs for upper-level components running on them. The bottom layer is written in a functional language like Scala, and the API provides a deep reference to Scala's functional programming ideas, providing a similar programming interface to Scala. Figure 1-2 is the spark process (the main object is Rdd).
Spark data is partitioned in a distributed environment, then the job is transformed into a DAG, and DAG scheduling and task distributed parallel processing is staged. (2) Shark
Shark is a data warehouse built on the basis of spark and hive. At present, Shark has completed the academic mission and terminated the development, but its structure and principle still have reference meaning. It provides a set of SQL interfaces that can query the data stored in hive, compatible with existing hive QL syntax. In this way, users who are familiar with hive QL or SQL can perform fast ad-hoc, reporting, and other types of SQL queries based on shark. Shark-bottom multiplexing hive parser, optimizer, and metadata storage and serialization interface. Shark translates hive QL compilation into a set of spark tasks for distributed operations. (3) Spark SQL
Spark SQL provides SQL query functionality on large data, similar to Shark's role throughout the ecosystem, which can be collectively referred to as SQL on Spark. Before, Shark's query compilation and optimizer relied on hive, making shark have to maintain a set of hive branches, while spark SQL used catalyst to do query resolution and optimizer, and at the bottom of the spark as the execution engine to implement SQL operator. Users can write SQL directly on Spark, which is equivalent to expanding a set of SQL operators for Spark, which undoubtedly enriches the spark operators and functions, while Spark SQL is compatible with different persistent storage (such as HDFS, hive, etc.), which lays a wide space for its development. (4) Spark streaming
Spark streaming the flow data by a specified time slice to Rdd, and then batch each RDD to achieve large-scale flow data processing. Its throughput can transcend the existing mainstream stream processing framework storm and provide a rich API for streaming data calculations. (5) Graphx
GRAPHX based on BSP model, the interface of similar pregel is encapsulated on spark, and large-scale synchronous global graph calculation is carried out, especially when the user makes multiple iterations, the advantage of spark memory calculation is especially obvious. (6) Tachyon
Tachyon is a distributed memory file system that can be understood as HDFs in memory. To provide higher performance, the data store is stripped of the Java Heap. Users can realize Rdd or file cross application sharing based on Tachyon, and provide high fault-tolerant mechanism to ensure the reliability of data. (7) Mesos
Mesos is a resource management framework that provides functionality similar to yarn. In which the user can run the spark, MapReduce, Tez, and other computing framework tasks. Mesos will isolate resources and tasks and implement efficient resource task scheduling. (8) blinkdb
BLINKDB is an approximate query engine for interactive SQL on massive data. It allows the user to complete an approximate query by making trade-offs between query accuracy and query response time. The accuracy of its data is controlled within the allowable error range. In order to achieve this goal, the core idea of BLINKDB is to establish and maintain a set of multidimensional samples from raw data over time through an adaptive optimization framework, select an appropriate size sample with a dynamic sample selection strategy, and then meet user query requirements based on query accuracy and response time. the reliance of Spark
(1) Map reduce model
As a distributed computing framework, Spark adopts the MapReduce model. On it, the traces of Google's map reduce and Hadoop are heavy, and it is clear that it is not a big innovation, but a micro-innovation. Under the premise that the basic idea is unchanged, it borrows, imitates and relies on the ancestors, has added a little improvement, has greatly promoted the MapReduce efficiency.
Using the MapReduce model to solve the problem of large data parallel computing, the biggest advantage is that it belongs to the same family as Hadoop. Because the same belongs to the MapReduce parallel programming model, rather than the MPI and OpenMP other models, complex algorithms, as long as they can be expressed in a Java algorithm, run on Hadoop, can be expressed in Scala, run on Spark, and have a multiple increase in speed. In contrast, switching between MPI and Hadoop algorithms is much more difficult.
(2) Functional programming
Spark is written by Scala, and the supported language is Scala. One reason is that Scala supports functional programming. This has created the Spark code concise, and secondly makes the process based on spark development, but also particularly concise. A complete mapreduce,hadoop need to create a mapper class and a reduce class, and spark only need to create a corresponding map function and reduce function, the amount of code greatly reduced.
(3) Mesos
Spark the need to consider the issue of distributed operation to Mesos, not care, which is one of the reasons for its code can be streamlined.
(4) HDFs and S3
Spark supports 2 types of distributed Storage systems: HDFs and S3. should be regarded as two of the most mainstream now. The read and write functions of the file system are spark by the mesos distributed implementation. If you want to do cluster test, and there is no HDFS environment, there is no EC2 environment, you can make an NFS, to ensure that all mesos slave can access, can also simulate. Spark Architecture
The spark architecture employs the Master-slave model in distributed computing. Master is the node that contains the master process in the cluster, and slave is the node in the cluster that contains the worker process. Master as the controller of the whole cluster, responsible for the normal operation of the whole cluster; the worker is equivalent to compute the node, receives the Master Node command and carries on the status report; Executor is responsible for the execution of the task, and the client is responsible for submitting the application Driver is responsible for controlling the execution of an application.
After the spark cluster is deployed, the master process and the worker process are initiated separately from the primary node and from the node to control the entire cluster. During the execution of a spark application, driver and worker are two important roles. The Driver program is the starting point for application logic execution, which is responsible for job scheduling, which is the distribution of task tasks, while multiple workers are used to manage compute nodes and create executor parallel processing tasks. During the execution phase, driver passes the file and jar that the task and task depend upon to the corresponding worker machine, while executor the task of the corresponding data partition.
Spark's overall process is: Client submission application, Master found a worker to start Driver,driver to master or resource Manager to request resources, then convert the application to Rdd Graph, and then by Dagscheduler Rdd Graph is converted to stage with a forward-free graph submitted to TaskScheduler, which is submitted by TaskScheduler for executor execution. In the process of task execution, other components work together to ensure the smooth implementation of the entire application.
The Spark architecture basic components are detailed in the appendix to this section.
Spark Run Logic
For RDD, there are two types of action, one is transformation, the other is action. Their essential difference is:
Transformation return value is still a rdd. It uses the design pattern of a chained call, and after a RDD is calculated, it is converted to another rdd, and then the RDD can make another conversion. This process is a distributed
action return value that is not a rdd. It is either a normal set of Scala, either a value, or null, to end up or return to the driver program, or write RDD to the file system
The graph above shows that in spark applications, the entire execution process logically creates a direction-free graph (DAG). After the action operator is triggered, all the cumulative operators are formed into a forward-free loop graph, and then the scheduler schedules the tasks on the graph to perform operations. The Spark scheduling method differs from the MapReduce. Spark different stages (Stage) according to the different dependency relationships between Rdd, and a stage contains a series of function execution lines. A, B, C, D, E, F in the diagram represent partitions in different rdd,rdd. Data from HDFs input spark to form RDD A and RDD C,rdd C perform map operations, convert to RDD D, RDD B, and RDD E to perform join operations, convert to F, and perform shuffle when B and E joins are converted to F, and finally RDD F Saveassequencefile the output through the function and save it to the HDFs. Spark on Mesos
In order to run on the Mesos frame, install Mesos specification and design, spark implement two classes, one is Sparkscheduler, in spark the class name is Mesosscheduler; one is Sparkexecutor, In Spark, the class name is executor. With these two classes, spark can be distributed through Mesos.
Spark will transform the RDD and MapReduce functions into a standard job and a series of tasks. Submitting to Sparkscheduler,sparkscheduler will submit the task to Mesos Master, assigned to a different slave by master, and ultimately to the spark Executor in slave, which will be assigned to task one by one. and return, compose the new Rdd, or write directly to the Distributed file system.
Spark on YARN
Spark on yarn allows the Spark computing model to run on the ladder yarn cluster, directly reading the data on the ladder and enjoying the rich computing resources of the ladder yarn cluster.
The Spark on yarn schema resolves as follows:
The spark job based on yarn first generates job information from the client and submits it to the Resourcemanager,resourcemanager to Appmaster when reporting to a nodemanager. NodeManager initiates the initialization of the job after Sparkappmaster,sparkappmaster startup, and then requests the resource to ResourceManager, Sparkappmaster through RPC let NodeManager start the corresponding sparkexecutor,sparkexecutor to Sparkappmaster report and complete the corresponding task. In addition, Sparkclient will get the job run state through Appmaster.
Appendix basic components in the Spark architecture Clustermanager: In the standalone mode is master (Master node), control the whole cluster, monitor worker. In yarn mode for the resource manager. Worker: From node, responsible for control compute node, start executor or driver. In yarn mode, the NodeManager is responsible for computing the node control. Driver: Runs the main () function of application and creates the Sparkcontext. Executor: The executor, the component that executes the task on the Worker node, and is used to start the thread pool to run the task. Each application has a separate set of executors. Sparkcontext: The context of the entire application, controlling the life cycle of the application. The basic computational unit of the Rdd:spark, a group of RDD that can be executed to form a direction-free graph RDD graph. DAG Scheduler: Builds a stage based DAG based on job (job) and submits stage to TaskScheduler. TaskScheduler: Distribute Tasks (Task) to executor execution. SPARKENV: A thread-level context that stores references to important components at run time. Sparkenv creates and contains references to some of the following important components. Mapoutputtracker: Responsible for storing shuffle meta information. Broadcastmanager: Responsible for the control of broadcast variables and storage of meta information. Blockmanager: Responsible for storage management, creation, and lookup blocks. Metricssystem: Monitor Run-time performance metrics information. SPARKCONF: Responsible for storing configuration information.