Spark is fast at the cost of losing the correctness of computing results.

Last Update:2015-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Yes, Spark is fast. However, it does not guarantee that the calculated value is correct, even if you only need to accumulate simple integers.

One of Spark's most famous papers is Spark: Cluster Computing with Working Sets. When you read it, you need to understand that the code in this article does not guarantee that the calculation result is correct. Specifically, its Logistic Regression Code uses accumulator in the map stage. The following explains why this is wrong.

Suppose there is a simple task:

Each row of the input file contains 100 integers, which must be added vertically.

For example:

Input

1 2 3 4 5... 100
1 2 3 4 5... 200
1 3 3 4 5... 100

Output

3 7 9 12 15... 400

Very simple, right? It can be regarded as a pig. On Hadoop, this problem can be solved through Map reduce. First, divide the input file into N equal-size blocks. Then, each block outputs 100 integers in a row, for example, 2, 4, 6, 8, 10,... 200.
Then the reducer receives the output results of each mapper and accumulates them to obtain the final result.

Disadvantages: From mapper to Cer CER is DISK-IO and network transmission. You need to transmit N * 100 integers. When the input set has a large dimension (each row has millions of bytes), it is a waste.

Spark cleverly introduces the concept of accumulator. The output of all tasks on the same machine is summarized locally and then sent to Cer. In this way, it is no longer the number of tasks * dimension, but the number of machines * dimension. It will save a lot. Specifically, when we are doing machine learning, we are very accustomed to using accumulator for such computation.

The accumulator is designed by careful. For example, only the master node can read the value of accumulator, and the worker node cannot. In "Performance and Scalability of Broadcast in Spark
"The author wrote:" Accumulators can be defined for any type that has an "add" operation and a "zero" value. due to their "add-only" semantics, they are easy to make fault-tolerant. ". But is that true? No.

If the accumulator is not running at the end of the operation, the correctness cannot be guaranteed. Because the accumulator is not the input or output of the map/reduce function, the accumulator is the side-effect in the expression evaluate. For example:

val acc = sc.accumulator(0)  data.map(x => acc += 1; f(x))  data.count()  // acc should equal data.count() heredata.foreach{...}  // Now, acc = 2 * data.count() because the map() was recomputed.

This issue was marked as Won't Fix by spark founder Matei.

So do you just write the code and be careful not to trigger repeated computations? No. A task may be fail-retry, or because the execution of a task is slow, multiple copies of the task are running at the same time. These may cause incorrect accumulator results. Accumulators can only be used in the actions of RDD and cannot be used in Transformations. For example: it can be used in reduce functions, but not in map functions.

If you do not need accumlators but want to save network transmission, Matei says: "I wocould suggest creating fewer tasks. if your input file has a lot of blocks and hence a lot of parallel tasks, you can use CoalescedRDD to create an RDD with fewer blocks from it."

That is to say, you can divide tasks into larger ones and reduce the number of tasks. For example, each machine has only one task. In fact, Downside is also very obvious, and the execution of the task is easy not balance.

Reference: https://issues.apache.org/jira/browse/SPARK-732
Https://issues.apache.org/jira/browse/SPARK-3628
Https://issues.apache.org/jira/browse/SPARK-5490

Https://github.com/apache/spark/pull/228

-------------------------------------- Split line --------------------------------------

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

-------------------------------------- Split line --------------------------------------

Spark details: click here
Spark: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark is fast at the cost of losing the correctness of computing results.

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support