A brief explanation of Spark's learning notes

Last Update:2015-06-20 Source: Internet

Author: User

Tags apache mesos

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview:

Spark is an open-source cluster computing system based on memory computing, which is designed to make data analysis faster.

Spark is very small, developed by a smaller team at the AMP Lab at the University of California, Berkeley. Language in use

The code for the core part of the project is Scala, with only 63 scala files. (AMP lab name is a bit of a point:

Algorithm machine people, algorithms, machines, people)

Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two

, these useful differences make spark more advantageous in some workloads, in other words

said that spark enabled the memory distribution dataset, which, in addition to being able to provide interactive queries, could also optimize the iteration

Workloads.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop,

Spark and Scala are tightly integrated, and Scala can operate as easily as a local collection object

Distributed data sets.

Spark also introduces a rich Rdd (elastic distributed data Set). An RDD is a group of nodes that are distributed only

A collection of Read objects. These collections are resilient and can be rebuilt if part of the data set is lost.

Reconstruction Section The process of a dataset relies on a fault-tolerant mechanism that can maintain "descent" (that is, allowing a number-based

rebuilding part of the data set according to the derivative process information). The RDD is represented as a Scala object and can be

Create it in the widget;

Summarize:
1.Spark is a development library
2. Any library that can run successfully can be part of spark
3. Universal, it can and Spark Sql,spark streaming,mllib (Machine leaning), GRAPHX seamless integration
It is a platform and is a common development library
4. Ideas from various industries and experts can be assembled into spark to become a powerful API

Spark Benefits:

1. First spark is a memory-based calculation

2. Provides a distributed parallel computing framework that supports DAG graphs, reducing intermediate result io overhead between multiple computations

3. Provide the cache mechanism to support multiple iterations or data sharing to reduce IO overhead

4.RDD maintains a bloodline relationship, once the RDD has been hung, can be automatically rebuilt through the parent RDD to ensure fault tolerance

5. Mobile computing rather than mobile data, the RDD partition can read the data blocks in the Distributed file system to the

Nodes in memory for calculation

6. Use a multi-thread pool model to reduce task startup overhead

Avoid unnecessary sort operations in the 7.shuffle process

8. Use fault-tolerant, highly scalable akka as a communication framework

To run the framework:

1.Hadoop of MapReduce frame platform yarn

2.Apache Mesos Frame Platform

3.Spark Standalone Framework Platform

4. Amazon's AWS Platform

Also, as with Hadoop2.7.0, the community decided from Spark1.5 will no longer support JDK1.6
JDK1.7 's References:
http://liujunjie51072.blog.163.com/blog/static/868916212009915105633843/

A brief explanation of Spark's learning notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More