Apache Spark is an open source cluster computing system, aims to do data analytics fast-both fast to run and fast To write.
Bdas, the Berkeley Data Analytics Stack, is an open source software stack This integrates software components being built By the Amplab-make sense of Big Data.
?
?
Spark | Components
VS. |
Hadoop | Components
Spark Core |
<------> |
Apache Hadoop MR |
Spark Streaming |
<------> |
Apache Storm |
Spark SQL |
<------> |
Apache Hive |
Spark GraphX |
<------> |
MPI (Taobao) |
Spark MLlib |
<------> |
Apache Mahout |
BLINKDB is a massively parallel, approximate query engine for running Interactive SQL queries on large V Olumes of data. It allows users to +, enabling interactive queries over massive data by running queries on data samples and presenting res Ults annotated with meaningful error bars.
Both key ideas:
- An adaptive optimization framework, builds and maintains a set of multi-dimensional samples from original data over Ti Me
- A dynamic sample selection strategy that selects an appropriately sized sample based on a query ' s accuracy and/or response Time requirements.
Why Spark is fast:
- In-memory Computing
- Directed acyclic graph (DAG) engine, compiler can see the whole computing Graph in advance so it can optimize it. Delay Scheduling
Resilient distributed Dataset
- A List of partitions
- A function for computing each split
- A List of dependencies on other RDDs
- Optionally, a partitioner for Key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Storage strategy
class StorageLevel private( private var useDisk_ : Boolean, private var useMemory_ : Boolean, private var deserialized_ : Boolean, private var replication_ : Int = 1) val MEMORY_ONLY_ = new StorageLevel(false, true, true)
RDD, Transformation & Action
Lazy evaluation
?
Spark Big Data Platform