Distributed computing for Spark

Last Update:2016-02-18 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is spark,spark and how to use spark

1.Spark distributed computing based on what algorithm (very simple)

2.Spark differs from MapReduce in any place

Why 3.Spark is more flexible than Hadoop

What are the 4.Spark limitations?

5. Under what circumstances is it appropriate to use spark

What is Spark

Spark is a common parallel computing framework for open source class Hadoop MapReduce for UC Berkeley AMP Labs, and Spark's distributed computing, based on the map reduce algorithm, has the benefits of Hadoop MapReduce But unlike MapReduce, where the job intermediate output and results can be stored in memory, which eliminates the need to read and write HDFs, Spark is better suited for map reduce algorithms such as data mining and machine learning that need to be iterated. The schema is as follows:

Spark vs. Hadoop

The intermediate data of Spark is put into memory, which is more efficient for iterative operations.

Spark is more suitable for more ML and DM operations than iterative operations. Because in spark, there is an abstraction of the RDD.

Spark is more generic than Hadoop

There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as map, filter, FLATMAP, sample, Groupbykey, Reducebykey, Union, join, Cogroup, Mapvalues, Sort,partionby and many other types of operations, Spark calls these operations transformations. It also provides various actions such as count, collect, reduce, lookup, save, and more.

These various types of data set operations are convenient for users who develop upper-level applications. The communication model between processing nodes is no longer the only data shuffle a pattern like Hadoop. Users can name, materialize, control the storage of intermediate results, partition, etc. It can be said that the programming model is more flexible than Hadoop.

However, because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate.

Fault tolerance

In distributed DataSet computing, fault tolerance is achieved through checkpoint, while checkpoint has two ways, one is checkpoint data and the other is logging the updates. Users can control which method is used to implement fault tolerance.

Availability of

Spark improves usability by providing rich Scala, Java,python APIs, and interactive shells.

The combination of Spark and Hadoop

Spark can read and write data directly to HDFS and also supports spark on YARN. Spark can run in the same cluster as mapreduce, sharing storage resources and computing, and the Data Warehouse shark implementation borrows hive, almost completely compatible with hive.

Spark's application scenario

Spark is a memory-based iterative computing framework for applications that require multiple operations of a particular data set. The more times you need to repeat the operation, the greater the amount of data to read, the greater the benefit, the smaller the amount of data, and the less computationally intensive the benefit is relatively small (this is an important factor in the large database architecture to consider using spark)

Because of the features of the RDD, Spark does not apply to applications that have an asynchronous fine-grained update state, such as Web service storage or incremental web crawlers and indexes. Is that the application model for that incremental modification is inappropriate. In general, the application of Spark is more extensive and common.

Operating mode

Local mode

Standalone mode

Mesoes mode

Yarn Mode

Spark Eco-System

Shark (Hive on Spark): Shark basically provides the same H IVEQL command interface as hive based on the spark framework, and Shark uses the Hive API to implement query for maximum retention and hive compatibility. Parsing and Logic Plan generation, the final Physicalplan execution phase with spark instead of Hadoop MapReduce. By configuring the shark parameter, shark can automatically cache a specific rdd in memory, enabling data reuse to speed up the retrieval of a particular data set. At the same time, Shark uses UDF user-defined function to realize specific data analysis learning algorithm, which can combine SQL data query and operation analysis to maximize the reuse of RDD.

Spark streaming: Building a framework for processing stream data on spark, the basic principle is to divide the stream data into small pieces of time (a few seconds) to process this small amount of data in a way similar to batch batch processing. Spark streaming is built on spark on the one hand because the spark's low latency Execution engine (100ms+) can be used for real-time computing, and the RDD dataset makes it easier to do efficient fault-tolerant processing than other record-based processing frameworks such as Storm. In addition, the small batch processing method makes it compatible with both batch and real-time data processing logic and algorithms. Facilitates some specific applications that require joint analysis of historical and real-time data.

Bagel:pregel on Spark, which can be calculated using spark, is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.

End.

Distributed computing for Spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More