Common operations for RDD in Spark (Python)

Source: Internet
Author: User
Tags pyspark

Elastic distributed Data Set (RDD)

Spark operates at the center of the RDD concept. The RDD is a fault-tolerant collection of elements that can be manipulated in parallel. There are two ways to create an rdd: to parallelize A collection that already exists in your driver, and to reference a dataset from an external storage system. One of the most important features of the RDD is distributed storage, where distributed storage has the greatest benefit of allowing data to be stored in parallel across different working nodes in order to parallel operations when data is needed. Elasticity refers to the memory of the node when it is stored, can also use the external memory, for the user to make large data processing to provide convenience. In addition to this, another major feature of the RDD is the lazy calculation, where a full rdd run task is divided into two parts: transformation and action

1.Transformation

Transformation is used to create an RDD, the RDD can only be created using transformation, and there are a number of methods of operation, including Map,filter,groupby,join, which the RDD uses to generate a new RDD, but note , no matter how many times transformation, it is impossible to really run before the real data in the RDD is calculated by the action.

2.Action

Action is the part of the data execution that actually performs the calculation of the data by executing Count,reduce,collect, and so on. In fact, all of the operations in the RDD are in lazy mode, and running in the compilation does not immediately calculate the final result, but instead remembers all the steps and methods of the operation and executes only if the display encounters a startup command. The advantage of doing this is that most of the upfront work is done at transformation, and when the action is working, you only need to take full advantage of the core work of doing the business freely.

Here is the creation of the Rdd in Python, along with some basic transformation,action operations.

#-*-Coding:utf-8-*-from pyspark import sparkcontext, sparkconffrom pyspark.streaming import Streamingcontextimport Mat Happname = "Jhl_spark_1" #你的应用程序名称master = "local" #设置单机conf = sparkconf (). Setappname (AppName). Setmaster (Master) #  Configuration SPARKCONTEXTSC = Sparkcontext (conf=conf) # parallelize: parallelized data, converted to Rdddata = [1, 2, 3, 4, 5]distdata = sc.parallelize (data, numslices=10) # Numslices for the number of blocks, Block # Textfile based on number of clusters read external Data Rdd = Sc.textfile ("./c2.txt") # read external files in behavioral units and convert to Rddprint RDD.C Ollect () # Map: Iterate, separate operation of data in data set Def my_add (L): Return (l,l) data = [1, 2, 3, 4, 5]distdata = sc.parallelize (data) # parallelization Set result = Distdata.map (my_add) print (Result.collect ()) # Returns a distribution DataSet # Filter: Filter Data def my_add (l): result = False if L & Gt 2:result = True return resultdata = [1, 2, 3, 4, 5]distdata = sc.parallelize (data) #并行化数据集, shard result = DISTDATA.F Ilter (my_add) print (Result.collect ()) #返回一个分布数据集 # Zip: Group Two rdd corresponding elements into tuples x = sc.parallelize (range (0,5)) y = sc.parallelize (Range (1005)) print X.zip (y). Collect() #union combination of two rddprint x.union (x). Collect () # Aciton operation # Collect: Return data in Rdd RDD = sc.parallelize (range (1, ten)) print Rddprint Rdd.collect () # Collectasmap: Takes the RDD element as a tuple, returns the data in the RDD as an element in the tuple as index m = sc.parallelize ([(' A ', 2), (3, 4)]). Collectasmap () Print m[' A ']print m[3]# groupby function: Group The RDD according to the provided method: Rdd = Sc.parallelize ([1, 1, 2, 3, 5, 8]) def fun (i): retur n i% 2result = Rdd.groupby (fun). Collect () print [(x, Sorted (y)) for (x, y) in result]# reduce: operation on a DataSet Rdd = Sc.parallelize (Range (1, ten)) result = Rdd.reduce (lambda A, b:a + b) Print result

  

In addition to the above, there are some common data operations for RDD such as:

Name () returns the names of the RDD

MIN () returns the minimum value in the RDD

SUM () overlays all elements in the RDD

Take (n) takes the top n elements of an RDD

COUNT () returns the number of elements of the RDD

For more information, please refer to: http://spark.apache.org/docs/latest/api/python/index.html

Common operations for RDD in Spark (Python)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.