Spark RDD Basic Operation

Source: Internet
Author: User
Tags foreach spark rdd
Spark RDD Scala language programming

The Rdd(Resilient distributed Dataset) is an immutable collection of distributed objects, each of which is divided into partitions that run on different nodes of the cluster. The RDD supports two types of operations: conversion (trainsformation) and action , and Spark only lazily calculates the RDD, which means that the rdd of the conversion operation is not immediately calculated, Instead, it calculates the first time it encounters an action operation, and if you want to reuse an RDD in multiple actions, you can use Rdd.persist () to let spark cache the RDD. 0. Initialize Sparkcontext

Import Org.apache.spark. {sparkconf, sparkcontext}

Val sc = new Sparkcontext (new sparkconf (). Setmaster ("local[*]"). Setappname ("Spark-rdd-demo"))
1. Create an RDD

Spark provides 2 ways to create an RDD: 1.1 Read an external data set

Val lines = sc.parallelize (List ("Java", "Scala", "C + +"))
1.2 parallelization of a collection in a drive program
Val lines = Sc.textfile ("Hdfs://dash-dev:9000/input/test.txt")
2. Rdd Operation 2.1 Conversion Actions

The RDD conversion operation is the return of the new RDD operation, the common conversion operations summarized as follows:

Table 1: Basic conversion operations for an RDD with data {1,2,3,3}

Name of function Purpose Example Results
Map () Apply the function to each element in the RDD, and the return value constitutes a new RDD Rdd.map (x=>x+1) {2,3,4,5}
FlatMap () Apply the function to each element in the RDD, and make all the contents of the returned iterator a new rdd, commonly used to slice words Rdd.flatmap (X=>x.to (2)) {1,2,2}
Filter () Returns an RDD consisting of elements passed into a function passed to the filter () Rdd.filter (x=> x>2) {3,3}
Distinct () Go heavy Rdd.distinct () {A-i}
Sample (Withreplacement, fraction, [seed]) The RDD sample, and whether to replace Rdd.sample (false, 0.5) The non-deterministic

Table 2: Conversion actions for 2 Rdd for data {2,3,4}rdd} and {+ +}

Name of function Purpose Example Results
Union () Find the 2-rdd set Rdd.union (Other) {1,2,3,4}
Intersection () Find the intersection of 2 Rdd Rdd.intersection (Other) {2,3}
Subtract () Find the difference set of 2 Rdd Rdd.subtract (Other) {1}
Cartesian () Ask for the Cartesian product of 2 Rdd Rdd.cartesian (Other) {1,3}, {*}, {1,4} ... {3,4}
Sample (Withreplacement, fraction, [seed]) The RDD sample, and whether to replace Rdd.sample (false, 0.5) The non-deterministic
2.2 Action Operation

The operation of the RDD will return the resulting results to the drive program, or to the external storage system.

Table 3: Action on basic RDD for an RDD with data {1,2,3,3}

Name of function Purpose Example Results
Redcue () Consolidate all elements in the RDD in parallel Rdd.reduce ((x, y) = x+y) 9
Collect () Return all elements in the RDD Rdd.collect () {1,2,3,4}
Count () Find the number of elements in the RDD Rdd.count () 4
Countbyvalue () The number of occurrences of each element in the RDD Rdd.countbyvalue () {2, 1}, {3,2}
Take (N) Returns n elements from an RDD Rdd.take (2) {A}
Top (N) Returns the first n elements from the RDD Rdd.top (3) {3,3,2}
foreach (func) Use the given function for each element in the RDD Rdd.foreach (print) 1,2,3,3
2.3 Passing functions to spark

Most of the conversion and action actions of spark depend on the user-passed function to calculate that, when the object being passed is a member of an object, or if it contains a reference to a field in an object (such as Self.field), Spark takes the entire object Sent to the work node--it's much bigger than what you meant to deliver. The alternative is to take the field you need from the object and put it in a local variable, and then pass the local variable:

Class Searchfunctions (Val query:string) {
    def isMatch (s:string): Boolean = {
        s.contains (query)
    }

    def Getmatchesfunctionreference (rdd:rdd[string]): rdd[string] = {
        //problem: "IsMatch" means "this.ismatch", so the entire
        this is passed Rdd.map (IsMatch)
    }

    def getmatchesfieldreference (rdd:rdd[string]): rdd[string] = {
        //question: "Query" means " This.query ", so the entire this
        rdd.map (x = x.split (query))
    }

    def getmatchesnoreference (rdd:rdd[string ]): rdd[string] = {
        //safe: Just take the fields we need into the local variables
        val localquery = this.query
        rdd.map (x = X.split ( localquery))
    }
}

Also, be aware that spark requires that our incoming functions and their applied data be serializable (implementing the Java Serializable Interface), otherwise notserializableexception will appear.

Author @wusuopubupt
November 11, 2016

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.