Why does spark streaming always use checkpoint, because it often uses previous things, assuming that spark has a single RDD, typically without a mid-result. Assuming that there are 1000 steps inside the Stage, it does not produce a 999 intermediate result, by default it only produces one intermediate result, and Hadoop produces a secondary intermediate result.
BecauseSparkof theRDDit itself is a collection of read-only partitions, but in order to deal with it only to the data mark, do not do the calculation model, so it isLazylevel, so every timeTransformationBuild a newRDD,It's all about the Father .RDDThe first parameter is passed in, thus constituting a chain, calculated by the lastActionwhen triggered, so there is only one intermediate result, which constitutes a process from backward to back, is a function of the expansion of the process, from the source can also see it is this from backward to back chain dependency relationship, from the source also see it is this from the back to the chain dependency relationship, So it's a very low-fault-tolerant overhead. Because the usual fault-tolerant methods are:1.Data Checkpoint (it works through the data Center network connection to different machines, each time the operation to replicate the entire data set, each time a copy is to be through the network, because to replicate to other machines, and broadband is a distributed bottleneck, which is a very large consumption of storage resources) 2.Record Data updates (each time the data changes, we record it, but this first complex, second consumption performance, the time of the re-calculation is more difficult to handle), since so many shortcomings, SparkWhy is it so efficient to record data updates?)RDDis immutable, so every operation becomes new.RDD + Lazy, there is no global modification problem, the control difficulty is greatly reduced, and the chain has been produced, can be very convenient fault-tolerant. 2.is the coarse-grained mode, the simplest to wait to think,RDDis aListorArray,rddis the abstraction of distributed functional programming, based onRDDprogramming generally uses advanced functions.
3.Stage End, data will write disk, is coarse-grained mode, is for efficiency, in order to simplify, if the update granularity too thin too much, the record cost is very high, the efficiency is not so high, the RDD The specific data of the change operation (write operation) are coarse-grained. the operation of the Rdd is coarse-grained (limiting its usage, and the web crawler is not suitable for rdd ), but the rdd read operations can be coarse-grained or fine-grained. Partition itself is a very common data structure, pointing to our specific data itself, that is, when computing the data is known there. And the computing logic for this series of data shards is the same. ( from the Liaoliang teacher's RDD secret )
4 :compute Why all RDD Span style= "font-family: Arial" > The operations return all iterators, with the benefit of seamless integration of all frameworks, the processing of results streams, and machine learning that can be intermodulation, whether machine learning operations sql sql How to operate machine learning, flow processing operation graph calculation, or stream processing operation sql, Everyone is based on the rdd rdd
2 this.type (), So you can pass the runtime Span style= "font-family:arial" >runtime rdd You can turn around to manipulate it so that you can use the interface and also call the sub-class below the interface.
5 :scala The interface is used and the sub-class under the interface can be called. On the basis of seamless integration, individual functions can be used. To produce nuclear fission: if I was doing finance, I developed a sub-framework of financial class, the sub-framework can be directly in the code to adjust the machine learning, graph calculation of what to share prediction, behavioral analysis, pattern analysis. You can also tune sql rdd all You can use all the other features.
6 : Because of preferedlocation,spark can handle all data, every time it fits the perfect data locality, Span style= "font-family:arial" >spark spark Do real-time transactional processing, the response is not so fast, the control is very difficult. such as bank transfer, do real-time processing is possible, in addition, spark To unified the world of data processing!
7: The disadvantage of RDD: currently does not support fine-grained write operations (such as web crawlers) and incremental iteration calculation (at each iteration, only one part of the data is iterated, itself coarse granularity, not very good support for incremental iterations
the above content must come from DT Big Data Dream Factory study, Dream Tutor Liaoliang, reproduced please indicate the source, thank you for your cooperation
Video share in this section : Http://pan.baidu.com/s/1hsQ2vv2 RDD decryption