Spark Rdd Secrets

Last Update:2016-04-07 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The various libraries available in spark compute such asSpark SQL,Spark machine learning , and so on are all packaged RDD

The RDD itself provides a generic abstraction, in the existing Spark SQL, spark streaming, machine learning, figure calculations as well as Sqpark R , you can expand and privatize the libraries associated with your business based on the content of the specific domain, and their common interface and Cornerstone is the Spark RDD.

RDD:The application abstraction of distributed function programming based on working set,MapReduceis based on the data set. Their common characteristics are positional perception, fault tolerance, and load balancing. DataSet-based processing works by loading data from a physical storage device, then manipulating the data, and finally writing to the physical storage device, but it includes many scenarios that are not applicable:1.not suitable for a large number of iterations2.not suitable for interactive queries, each query will read data from disk, and then write back data results, if the complex query has multiple steps, you need to disk-based, this is only the impact of speed. The focus is on the way of technical data flow, can not reuse the results of the past or intermediate results, this is fatal, for example, there are thousands of people concurrently operating a data warehouse, if there are 100 of people query is exactly the same, it will reload the data, re-query. However, based on the working set, the results are reused, and the results of the intermediate calculations are the previousTenThe steps are the same, the data set is not reused, and the working set-based approach is reused. andRDD (resillient distributed Dataset)is based on the working set, the way the working set has the following flexibility:1.automatic switchover of memory and disk data storage.

2.based onLineageof efficient fault tolerance,3.TaskIf it fails, a specific number of retries are made automatically. 4.StageIf a failure occurs automatically, a specific number of retries are made, and only failed shards are computed when retrying. 5.checkpointand thepersist, the chain is long, the calculation is more cumbersome, we put the data on the disk/hdfson, this ischeckpoint,andpersist,is to reuse the data in memory or on disk. This is an extension point for efficiency and fault tolerance. 6.data scheduling resiliency,DAG TASKindependent of resource management. 7: High elasticity of data fragmentation, such as a lot of data fragments during the calculation,Partitionis particularly small. It consumes one thread at a time, which reduces processing efficiency. This will consider putting multiplepartionmerged into a largePartitionimprove efficiency. On the other hand, memory is not so much time, butPartitionwhen data is large, the dataBlockrelatively large. Would consider turning it into a smaller shard, which would allowSparkThere are more processing batches, but they do not appearOOM. This data fragmentation, we can artificially improve the degree of parallelism, reduce the degree of parallelism, is a high degree of elasticity, and it is completely local data ( Dt_spark Big Data Dream Factory,IMF Open Class )

and from 10,000 shards into 10 million shards, you may generally need shuffle rdd

hint: spark location-aware ratio hadoop position perception is much better, hadoop partition spark partition stage operation, will determine this position, it is more refined.

Why does spark streaming always use checkpoint, because it often uses previous things, assuming that spark has a single RDD, typically without a mid-result. Assuming that there are 1000 steps inside the Stage, it does not produce a 999 intermediate result, by default it only produces one intermediate result, and Hadoop produces a secondary intermediate result.
BecauseSparkof theRDDit itself is a collection of read-only partitions, but in order to deal with it only to the data mark, do not do the calculation model, so it isLazylevel, so every timeTransformationBuild a newRDD,It's all about the Father .RDDThe first parameter is passed in, thus constituting a chain, calculated by the lastActionwhen triggered, so there is only one intermediate result, which constitutes a process from backward to back, is a function of the expansion of the process, from the source can also see it is this from backward to back chain dependency relationship, from the source also see it is this from the back to the chain dependency relationship, So it's a very low-fault-tolerant overhead. Because the usual fault-tolerant methods are:1.Data Checkpoint (it works through the data Center network connection to different machines, each time the operation to replicate the entire data set, each time a copy is to be through the network, because to replicate to other machines, and broadband is a distributed bottleneck, which is a very large consumption of storage resources) 2.Record Data updates (each time the data changes, we record it, but this first complex, second consumption performance, the time of the re-calculation is more difficult to handle), since so many shortcomings, SparkWhy is it so efficient to record data updates?)RDDis immutable, so every operation becomes new.RDD + Lazy, there is no global modification problem, the control difficulty is greatly reduced, and the chain has been produced, can be very convenient fault-tolerant. 2.is the coarse-grained mode, the simplest to wait to think,RDDis aListorArray,rddis the abstraction of distributed functional programming, based onRDDprogramming generally uses advanced functions.
3.Stage End, data will write disk, is coarse-grained mode, is for efficiency, in order to simplify, if the update granularity too thin too much, the record cost is very high, the efficiency is not so high, the RDD The specific data of the change operation (write operation) are coarse-grained. the operation of the Rdd is coarse-grained (limiting its usage, and the web crawler is not suitable for rdd ), but the rdd read operations can be coarse-grained or fine-grained. Partition itself is a very common data structure, pointing to our specific data itself, that is, when computing the data is known there. And the computing logic for this series of data shards is the same. ( from the Liaoliang teacher's RDD secret )
4 :compute Why all RDD Span style= "font-family: Arial" > The operations return all iterators, with the benefit of seamless integration of all frameworks, the processing of results streams, and machine learning that can be intermodulation, whether machine learning operations sql sql How to operate machine learning, flow processing operation graph calculation, or stream processing operation sql, Everyone is based on the rdd rdd
2 this.type (), So you can pass the runtime Span style= "font-family:arial" >runtime rdd You can turn around to manipulate it so that you can use the interface and also call the sub-class below the interface.
5 :scala The interface is used and the sub-class under the interface can be called. On the basis of seamless integration, individual functions can be used. To produce nuclear fission: if I was doing finance, I developed a sub-framework of financial class, the sub-framework can be directly in the code to adjust the machine learning, graph calculation of what to share prediction, behavioral analysis, pattern analysis. You can also tune sql rdd all You can use all the other features.
6 : Because of preferedlocation,spark can handle all data, every time it fits the perfect data locality, Span style= "font-family:arial" >spark spark Do real-time transactional processing, the response is not so fast, the control is very difficult. such as bank transfer, do real-time processing is possible, in addition, spark To unified the world of data processing!
7: The disadvantage of RDD: currently does not support fine-grained write operations (such as web crawlers) and incremental iteration calculation (at each iteration, only one part of the data is iterated, itself coarse granularity, not very good support for incremental iterations
the above content must come from DT Big Data Dream Factory study, Dream Tutor Liaoliang, reproduced please indicate the source, thank you for your cooperation
Video share in this section : Http://pan.baidu.com/s/1hsQ2vv2 RDD decryption

Spark Rdd Secrets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More