first, what is spark?
1. Relationship with Hadoop
Today, Hadoop cannot be called software in a narrow sense, and Hadoop is widely said to be a complete ecosystem that can include HDFs, Map-reduce, HBASE, Hive, and so on.
While Spark is a computational framework, note that it is a computational framework
It can run on top of Hadoop, most of which is based on HDFs
Instead of Hadoop, it replaces map-reduce in Hadoop to address some of the problems that Map-reduce brings.
More specifically, Spark is a memory-based, big data parallel computing framework that can be seamlessly integrated into the Hadoop ecosystem.
And since there are two issues that the distributed framework must address:
1. Scalability
2. Fault tolerance
How spark solves both issues after finishing publishing
2. What are the advantages and disadvantages of Spark's computational model relative to the map-reduce iteration model?
Advantage:
(1) Based on memory, fast computing speed
In the iterative process, the use of the RDD operator to produce a DAG graph eliminates the need to write intermediate data to disk
(2) Implementation strategy of DAG graph
Only the action action triggers the execution of the job, documenting the execution flow of each job, forming lineage and dividing the stage, etc.
(3) Use Akka as event-driven to dispatch tasks with little overhead
(4) Full stack support
Defects:
(1) Higher machine configuration requirements than map-reduce
(2) Sacrificing hardware to improve performance
3. What can spark bring?
(1) Full stack multi-calculation paradigm, not only simple map-reduce operation, SQL query, stream computing, machine learning, graph algorithm
Here's everything you want ~
(2) Lightweight and fast processing: memory-based
(3) Support multi-language, rich operators, allowing interactive computing in the shell, like writing a standalone program to write a distributed program (this is the purpose of spark birth)
(4) compatible with storage layer such as HDFS, can run independently, can run with yarn and other cluster management system, can read and use any Hadoop data
There's no justice.
second, Spark ecosystem bdas (data analysis Stack)
Spark can also exist out of Hadoop, and it has its own ecosystem.
The main components are the following list:
1. The core frame is spark
Provides a distributed programming framework
Provides rich operators and computational models in addition to Mr
Abstract distributed data into an elastic distributed data set (RDD)
2. Structured data SQL query and Analysis engine spark SQL
SQL statements can be executed directly
A rich API that can be executed by spark SQL
RDD-based operation
3. Distributed Machine Learning Library Mllib
4. Parallel graph calculation Frame Graphx
5. Flow calculation Framework Spark streaming
Divide real-time data into streaming data by a specified time slice
6. Approximate calculation query engine blinkdb
Approximate query for interactive SQL
Allows users to trade-offs between query accuracy and query response time
7. Memory Distributed File System Tachyon
In-memory HDFs
8. Resource Management Framework Mesos
Provides yarn-like functionality
9, 1.4 new features Sparkr
third, spark architecture
1. Frame composition
Some core concepts in the Spark cluster:
(1) Master
Nodes in the cluster that contain the master process
Responsible for cluster collaboration and management
Does not participate in computational tasks by itself
Behaves as ResourceManager when running on yarn
(2) Slaves
Nodes in the cluster that contain worker processes
Accept Master commands and perform status reporting
The worker itself is not the process of computing tasks
behaves as NodeManager when running on yarn
(3) Driver
Responsible for controlling the execution of tasks submitted by the client
Executes the program's main function and creates a Sparkcontext
Distributing a task to a executor on a specific worker
The distribution task executes the required file and jar packages (after serialization) to the worker node
(4) Sparkcontext: The context of the entire application, controlling the life cycle of the application
RDD: A basic computing unit that provides a rich set of operators, a group of RDD that can be executed into a DAG graph
Dagscheduler: Input dag graph, divided into stages output based on dependency between RDD
TaskScheduler: Enter stages, divide it into smaller task distribution to specific executor execution
Sparkenv: Stores references to important components of the runtime, including:
=>mapoutputtracker: Responsible for shuffle meta-information storage
=>broadcastmanager: Responsible for the control of broadcast variables and the storage of meta-information
=>blockmanager: Responsible for storage management, creation, and lookup blocks
=>metricssystem: Monitoring run-time performance metrics information
= = sparkconf: Responsible for storing configuration information
(5) Client
Tools for users to submit tasks
2. Spark Execution Task Flow (abbreviated version)
(1) Client submission Application
(2) Master finds the worker and starts driver
(3) Driver request resources from master
(4) Operation Rdd to form DAG diagram to Dagscheduler
(5) Dagscheduler the DAG graph into stages output to TaskScheduler
(6) TaskScheduler partition task distributed to executor execution on worker node
iv. Similarities and differences between spark distributed architecture and single-core architecture
Basic concepts:
(1) Spark is a distributed computing framework
(2) on the above can write distributed programs and software
Points to note in writing distributed programs:
Memory and disk sharing issues
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Spark Grooming (i): What Spark is and what it does