Common operations for RDD in Spark (Python)

Last Update:2016-07-08 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elastic distributed Data Set (RDD)

Spark operates at the center of the RDD concept. The RDD is a fault-tolerant collection of elements that can be manipulated in parallel. There are two ways to create an rdd: to parallelize A collection that already exists in your driver, and to reference a dataset from an external storage system. One of the most important features of the RDD is distributed storage, where distributed storage has the greatest benefit of allowing data to be stored in parallel across different working nodes in order to parallel operations when data is needed. Elasticity refers to the memory of the node when it is stored, can also use the external memory, for the user to make large data processing to provide convenience. In addition to this, another major feature of the RDD is the lazy calculation, where a full rdd run task is divided into two parts: transformation and action

1.Transformation

Transformation is used to create an RDD, the RDD can only be created using transformation, and there are a number of methods of operation, including Map,filter,groupby,join, which the RDD uses to generate a new RDD, but note , no matter how many times transformation, it is impossible to really run before the real data in the RDD is calculated by the action.

2.Action

Action is the part of the data execution that actually performs the calculation of the data by executing Count,reduce,collect, and so on. In fact, all of the operations in the RDD are in lazy mode, and running in the compilation does not immediately calculate the final result, but instead remembers all the steps and methods of the operation and executes only if the display encounters a startup command. The advantage of doing this is that most of the upfront work is done at transformation, and when the action is working, you only need to take full advantage of the core work of doing the business freely.

Here is the creation of the Rdd in Python, along with some basic transformation,action operations.

#-*-Coding:utf-8-*-from pyspark import sparkcontext, sparkconffrom pyspark.streaming import Streamingcontextimport Mat Happname = "Jhl_spark_1" #你的应用程序名称master = "local" #设置单机conf = sparkconf (). Setappname (AppName). Setmaster (Master) #  Configuration SPARKCONTEXTSC = Sparkcontext (conf=conf) # parallelize: parallelized data, converted to Rdddata = [1, 2, 3, 4, 5]distdata = sc.parallelize (data, numslices=10) # Numslices for the number of blocks, Block # Textfile based on number of clusters read external Data Rdd = Sc.textfile ("./c2.txt") # read external files in behavioral units and convert to Rddprint RDD.C Ollect () # Map: Iterate, separate operation of data in data set Def my_add (L): Return (l,l) data = [1, 2, 3, 4, 5]distdata = sc.parallelize (data) # parallelization Set result = Distdata.map (my_add) print (Result.collect ()) # Returns a distribution DataSet # Filter: Filter Data def my_add (l): result = False if L & Gt 2:result = True return resultdata = [1, 2, 3, 4, 5]distdata = sc.parallelize (data) #并行化数据集, shard result = DISTDATA.F Ilter (my_add) print (Result.collect ()) #返回一个分布数据集 # Zip: Group Two rdd corresponding elements into tuples x = sc.parallelize (range (0,5)) y = sc.parallelize (Range (1005)) print X.zip (y). Collect() #union combination of two rddprint x.union (x). Collect () # Aciton operation # Collect: Return data in Rdd RDD = sc.parallelize (range (1, ten)) print Rddprint Rdd.collect () # Collectasmap: Takes the RDD element as a tuple, returns the data in the RDD as an element in the tuple as index m = sc.parallelize ([(' A ', 2), (3, 4)]). Collectasmap () Print m[' A ']print m[3]# groupby function: Group The RDD according to the provided method: Rdd = Sc.parallelize ([1, 1, 2, 3, 5, 8]) def fun (i): retur n i% 2result = Rdd.groupby (fun). Collect () print [(x, Sorted (y)) for (x, y) in result]# reduce: operation on a DataSet Rdd = Sc.parallelize (Range (1, ten)) result = Rdd.reduce (lambda A, b:a + b) Print result

In addition to the above, there are some common data operations for RDD such as:

Name () returns the names of the RDD

MIN () returns the minimum value in the RDD

SUM () overlays all elements in the RDD

Take (n) takes the top n elements of an RDD

COUNT () returns the number of elements of the RDD

For more information, please refer to: http://spark.apache.org/docs/latest/api/python/index.html

Common operations for RDD in Spark (Python)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More