Spark is fast at the cost of losing the correctness of computing results.

Source: Internet
Author: User

Spark is fast at the cost of losing the correctness of computing results.

Yes, Spark is fast. However, it does not guarantee that the calculated value is correct, even if you only need to accumulate simple integers.

One of Spark's most famous papers is Spark: Cluster Computing with Working Sets. When you read it, you need to understand that the code in this article does not guarantee that the calculation result is correct. Specifically, its Logistic Regression Code uses accumulator in the map stage. The following explains why this is wrong.

Suppose there is a simple task:

Each row of the input file contains 100 integers, which must be added vertically.

For example:

Input

1 2 3 4 5... 100
1 2 3 4 5... 200
1 3 3 4 5... 100

Output

3 7 9 12 15... 400

Very simple, right? It can be regarded as a pig. On Hadoop, this problem can be solved through Map reduce. First, divide the input file into N equal-size blocks. Then, each block outputs 100 integers in a row, for example, 2, 4, 6, 8, 10,... 200.
Then the reducer receives the output results of each mapper and accumulates them to obtain the final result.

Disadvantages: From mapper to Cer CER is DISK-IO and network transmission. You need to transmit N * 100 integers. When the input set has a large dimension (each row has millions of bytes), it is a waste.

Spark cleverly introduces the concept of accumulator. The output of all tasks on the same machine is summarized locally and then sent to Cer. In this way, it is no longer the number of tasks * dimension, but the number of machines * dimension. It will save a lot. Specifically, when we are doing machine learning, we are very accustomed to using accumulator for such computation.

The accumulator is designed by careful. For example, only the master node can read the value of accumulator, and the worker node cannot. In "Performance and Scalability of Broadcast in Spark
"The author wrote:" Accumulators can be defined for any type that has an "add" operation and a "zero" value. due to their "add-only" semantics, they are easy to make fault-tolerant. ". But is that true? No.

If the accumulator is not running at the end of the operation, the correctness cannot be guaranteed. Because the accumulator is not the input or output of the map/reduce function, the accumulator is the side-effect in the expression evaluate. For example:

val acc = sc.accumulator(0)  data.map(x => acc += 1; f(x))  data.count()  // acc should equal data.count() heredata.foreach{...}  // Now, acc = 2 * data.count() because the map() was recomputed. 

This issue was marked as Won't Fix by spark founder Matei.

So do you just write the code and be careful not to trigger repeated computations? No. A task may be fail-retry, or because the execution of a task is slow, multiple copies of the task are running at the same time. These may cause incorrect accumulator results. Accumulators can only be used in the actions of RDD and cannot be used in Transformations. For example: it can be used in reduce functions, but not in map functions.

If you do not need accumlators but want to save network transmission, Matei says: "I wocould suggest creating fewer tasks. if your input file has a lot of blocks and hence a lot of parallel tasks, you can use CoalescedRDD to create an RDD with fewer blocks from it."

That is to say, you can divide tasks into larger ones and reduce the number of tasks. For example, each machine has only one task. In fact, Downside is also very obvious, and the execution of the task is easy not balance.

Reference: https://issues.apache.org/jira/browse/SPARK-732
Https://issues.apache.org/jira/browse/SPARK-3628
Https://issues.apache.org/jira/browse/SPARK-5490

Https://github.com/apache/spark/pull/228

-------------------------------------- Split line --------------------------------------

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

-------------------------------------- Split line --------------------------------------

Spark details: click here
Spark: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.