Overview:
Spark is an open-source cluster computing system based on memory computing, which is designed to make data analysis faster.
Spark is very small, developed by a smaller team at the AMP Lab at the University of California, Berkeley. Language in use
The code for the core part of the project is Scala, with only 63 scala files. (AMP lab name is a bit of a point:
Algorithm machine people, algorithms, machines, people)
Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two
, these useful differences make spark more advantageous in some workloads, in other words
said that spark enabled the memory distribution dataset, which, in addition to being able to provide interactive queries, could also optimize the iteration
Workloads.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop,
Spark and Scala are tightly integrated, and Scala can operate as easily as a local collection object
Distributed data sets.
Spark also introduces a rich Rdd (elastic distributed data Set). An RDD is a group of nodes that are distributed only
A collection of Read objects. These collections are resilient and can be rebuilt if part of the data set is lost.
Reconstruction Section The process of a dataset relies on a fault-tolerant mechanism that can maintain "descent" (that is, allowing a number-based
rebuilding part of the data set according to the derivative process information). The RDD is represented as a Scala object and can be
Create it in the widget;
Summarize:
1.Spark is a development library
2. Any library that can run successfully can be part of spark
3. Universal, it can and Spark Sql,spark streaming,mllib (Machine leaning), GRAPHX seamless integration
It is a platform and is a common development library
4. Ideas from various industries and experts can be assembled into spark to become a powerful API
Spark Benefits:
1. First spark is a memory-based calculation
2. Provides a distributed parallel computing framework that supports DAG graphs, reducing intermediate result io overhead between multiple computations
3. Provide the cache mechanism to support multiple iterations or data sharing to reduce IO overhead
4.RDD maintains a bloodline relationship, once the RDD has been hung, can be automatically rebuilt through the parent RDD to ensure fault tolerance
5. Mobile computing rather than mobile data, the RDD partition can read the data blocks in the Distributed file system to the
Nodes in memory for calculation
6. Use a multi-thread pool model to reduce task startup overhead
Avoid unnecessary sort operations in the 7.shuffle process
8. Use fault-tolerant, highly scalable akka as a communication framework
To run the framework:
1.Hadoop of MapReduce frame platform yarn
2.Apache Mesos Frame Platform
3.Spark Standalone Framework Platform
4. Amazon's AWS Platform
Also, as with Hadoop2.7.0, the community decided from Spark1.5 will no longer support JDK1.6
JDK1.7 's References:
http://liujunjie51072.blog.163.com/blog/static/868916212009915105633843/
A brief explanation of Spark's learning notes