Spark RDD Basic Operation

Last Update:2018-07-23 Source: Internet

Author: User

Tags foreach spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark RDD Scala language programming

The Rdd(Resilient distributed Dataset) is an immutable collection of distributed objects, each of which is divided into partitions that run on different nodes of the cluster. The RDD supports two types of operations: conversion (trainsformation) and action , and Spark only lazily calculates the RDD, which means that the rdd of the conversion operation is not immediately calculated, Instead, it calculates the first time it encounters an action operation, and if you want to reuse an RDD in multiple actions, you can use Rdd.persist () to let spark cache the RDD. 0. Initialize Sparkcontext

Import Org.apache.spark. {sparkconf, sparkcontext}

Val sc = new Sparkcontext (new sparkconf (). Setmaster ("local[*]"). Setappname ("Spark-rdd-demo"))

1. Create an RDD

Spark provides 2 ways to create an RDD: 1.1 Read an external data set

Val lines = sc.parallelize (List ("Java", "Scala", "C + +"))

1.2 parallelization of a collection in a drive program

Val lines = Sc.textfile ("Hdfs://dash-dev:9000/input/test.txt")

2. Rdd Operation 2.1 Conversion Actions

The RDD conversion operation is the return of the new RDD operation, the common conversion operations summarized as follows:

Table 1: Basic conversion operations for an RDD with data {1,2,3,3}

Name of function	Purpose	Example	Results
Map ()	Apply the function to each element in the RDD, and the return value constitutes a new RDD	Rdd.map (x=>x+1)	{2,3,4,5}
FlatMap ()	Apply the function to each element in the RDD, and make all the contents of the returned iterator a new rdd, commonly used to slice words	Rdd.flatmap (X=>x.to (2))	{1,2,2}
Filter ()	Returns an RDD consisting of elements passed into a function passed to the filter ()	Rdd.filter (x=> x>2)	{3,3}
Distinct ()	Go heavy	Rdd.distinct ()	{A-i}
Sample (Withreplacement, fraction, [seed])	The RDD sample, and whether to replace	Rdd.sample (false, 0.5)	The non-deterministic

Table 2: Conversion actions for 2 Rdd for data {2,3,4}rdd} and {+ +}

Name of function	Purpose	Example	Results
Union ()	Find the 2-rdd set	Rdd.union (Other)	{1,2,3,4}
Intersection ()	Find the intersection of 2 Rdd	Rdd.intersection (Other)	{2,3}
Subtract ()	Find the difference set of 2 Rdd	Rdd.subtract (Other)	{1}
Cartesian ()	Ask for the Cartesian product of 2 Rdd	Rdd.cartesian (Other)	{1,3}, {*}, {1,4} ... {3,4}
Sample (Withreplacement, fraction, [seed])	The RDD sample, and whether to replace	Rdd.sample (false, 0.5)	The non-deterministic

2.2 Action Operation

The operation of the RDD will return the resulting results to the drive program, or to the external storage system.

Table 3: Action on basic RDD for an RDD with data {1,2,3,3}

Name of function	Purpose	Example	Results
Redcue ()	Consolidate all elements in the RDD in parallel	Rdd.reduce ((x, y) = x+y)	9
Collect ()	Return all elements in the RDD	Rdd.collect ()	{1,2,3,4}
Count ()	Find the number of elements in the RDD	Rdd.count ()	4
Countbyvalue ()	The number of occurrences of each element in the RDD	Rdd.countbyvalue ()	{2, 1}, {3,2}
Take (N)	Returns n elements from an RDD	Rdd.take (2)	{A}
Top (N)	Returns the first n elements from the RDD	Rdd.top (3)	{3,3,2}
foreach (func)	Use the given function for each element in the RDD	Rdd.foreach (print)	1,2,3,3

2.3 Passing functions to spark

Most of the conversion and action actions of spark depend on the user-passed function to calculate that, when the object being passed is a member of an object, or if it contains a reference to a field in an object (such as Self.field), Spark takes the entire object Sent to the work node--it's much bigger than what you meant to deliver. The alternative is to take the field you need from the object and put it in a local variable, and then pass the local variable:

Class Searchfunctions (Val query:string) {
    def isMatch (s:string): Boolean = {
        s.contains (query)
    }

    def Getmatchesfunctionreference (rdd:rdd[string]): rdd[string] = {
        //problem: "IsMatch" means "this.ismatch", so the entire
        this is passed Rdd.map (IsMatch)
    }

    def getmatchesfieldreference (rdd:rdd[string]): rdd[string] = {
        //question: "Query" means " This.query ", so the entire this
        rdd.map (x = x.split (query))
    }

    def getmatchesnoreference (rdd:rdd[string ]): rdd[string] = {
        //safe: Just take the fields we need into the local variables
        val localquery = this.query
        rdd.map (x = X.split ( localquery))
    }
}

Also, be aware that spark requires that our incoming functions and their applied data be serializable (implementing the Java Serializable Interface), otherwise notserializableexception will appear.

Author @wusuopubupt
November 11, 2016

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More