Spark is fast at the cost of losing the correctness of computing results.
Yes, Spark is fast. However, it does not guarantee that the calculated value is correct, even if you only need to accumulate simple integers.
One of Spark's most famous papers is Spark: Cluster Computing with Working Sets. When you read it, you need to understand that the code in this article does not guarantee that the calculation result is correct. Specifically, its Logistic Regression Code uses accumulator in the map stage. The following explains why this is wrong.
Suppose there is a simple task:
Each row of the input file contains 100 integers, which must be added vertically.
For example:
Input
1 2 3 4 5... 100
1 2 3 4 5... 200
1 3 3 4 5... 100
Output
3 7 9 12 15... 400
Very simple, right? It can be regarded as a pig. On Hadoop, this problem can be solved through Map reduce. First, divide the input file into N equal-size blocks. Then, each block outputs 100 integers in a row, for example, 2, 4, 6, 8, 10,... 200.
Then the reducer receives the output results of each mapper and accumulates them to obtain the final result.
Disadvantages: From mapper to Cer CER is DISK-IO and network transmission. You need to transmit N * 100 integers. When the input set has a large dimension (each row has millions of bytes), it is a waste.
Spark cleverly introduces the concept of accumulator. The output of all tasks on the same machine is summarized locally and then sent to Cer. In this way, it is no longer the number of tasks * dimension, but the number of machines * dimension. It will save a lot. Specifically, when we are doing machine learning, we are very accustomed to using accumulator for such computation.
The accumulator is designed by careful. For example, only the master node can read the value of accumulator, and the worker node cannot. In "Performance and Scalability of Broadcast in Spark
"The author wrote:" Accumulators can be defined for any type that has an "add" operation and a "zero" value. due to their "add-only" semantics, they are easy to make fault-tolerant. ". But is that true? No.
If the accumulator is not running at the end of the operation, the correctness cannot be guaranteed. Because the accumulator is not the input or output of the map/reduce function, the accumulator is the side-effect in the expression evaluate. For example:
val acc = sc.accumulator(0) data.map(x => acc += 1; f(x)) data.count() // acc should equal data.count() heredata.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed.
This issue was marked as Won't Fix by spark founder Matei.
So do you just write the code and be careful not to trigger repeated computations? No. A task may be fail-retry, or because the execution of a task is slow, multiple copies of the task are running at the same time. These may cause incorrect accumulator results. Accumulators can only be used in the actions of RDD and cannot be used in Transformations. For example: it can be used in reduce functions, but not in map functions.
If you do not need accumlators but want to save network transmission, Matei says: "I wocould suggest creating fewer tasks. if your input file has a lot of blocks and hence a lot of parallel tasks, you can use CoalescedRDD to create an RDD with fewer blocks from it."
That is to say, you can divide tasks into larger ones and reduce the number of tasks. For example, each machine has only one task. In fact, Downside is also very obvious, and the execution of the task is easy not balance.
Reference: https://issues.apache.org/jira/browse/SPARK-732
Https://issues.apache.org/jira/browse/SPARK-3628
Https://issues.apache.org/jira/browse/SPARK-5490
Https://github.com/apache/spark/pull/228
-------------------------------------- Split line --------------------------------------
Spark1.0.0 Deployment Guide
Install Spark0.8.0 in CentOS 6.2 (64-bit)
Introduction to Spark and its installation and use in Ubuntu
Install the Spark cluster (on CentOS)
Hadoop vs Spark Performance Comparison
Spark installation and learning
Spark Parallel Computing Model
-------------------------------------- Split line --------------------------------------
Spark details: click here
Spark: click here
This article permanently updates the link address: