Spark Learning Javardd

Source: Internet
Author: User

RDD Introduction

The RDD, full name resilient distributed Datasets (elastic distributed data Set), is the core concept of spark and is the abstraction of Spark's data. The RDD is a distributed collection of elements, each of which supports only read operations, and each RDD is partitioned into multiple partitions that are stored on different nodes of the cluster. In addition, RDD allows the user to display the specified data stored in memory and disk, mastering the RDD programming is the first step in spark development.

1: Create operation (Creation operation): the creation of the RDD is the responsibility of sparkcontext.
2: Conversion operation (transformation operation): Converts an rdd to another rdd through a certain operation.
3: Action operation: Spark is lazy and action on RDD triggers the operation of the spark job
4: Control operation: The RDD is persisted and so on.

Demo Code Address: Https://github.com/zhp8341/sparkdemo/blob/master/src/main/java/com/demo/spark/rdddemo/OneRDD.java

One: Create an action

There are two ways of creating an Rdd:
1 reading a data set (Sparkcontext.textfile ()):

Javadstreamlines=jssc.textfilestream ("/users/huipeizhu/documents/sparkdata/input/");
Javareceiverinputdstreamlines = Jssc.sockettextstream ("localhost", 9999);


2 read a collection (Sparkcontext.parallelize ()):

Listlist = Arrays.aslist (5, 4, 3, 2, 1);
Javarddrdd = sc.parallelize (list), second: conversion operation

1: Single RDD conversion operation
Map (): operates on each element, returning a new Rdd
System.out.println ("Rdd per element multiply:" + rdd.map (V-V * 10)


Filter (): the most each element is filtered to return a new RDD consisting of eligible elements
System.out.println ("Rdd removes elements of 1:" + rdd.filter (V-V! = 1));

FlatMap (): operates on each element, returning all elements of the returned iterator to a new RDD
R.dd.flatmap (X-x.to (3)). Collect ()

Distinct (): Redo operation
System.out.println ("Rdd de-re-operation:" + rdd.distinct ());

RDD Maximum and minimum values

IntegerMax= rdd. Reduce ((v1, v2) -> Math. Max (v1, v2));

Integer Min=  rdd.reduce ((v1, v2) - > math .min (v1, v2))


2: Two RDD conversion actions:


[1, 2, 3] [3, 4, 5] Two RDD simple related operations

Union (): merge, do not go heavy
System.out.println ("Two Rdd sets:" + rdd1.union (RDD2). Collect ());

Intersection (): Intersection
System.out.println ("Two Rdd collection Common elements:" + rdd1.intersection (RDD2). Collect ());

Cartesian (): Cartesian product
The Cartesian product of System.out.println ("and another Rdd set:" + Rdd1.cartesian (RDD2). Collect ());

Subtract (): Remove the same content
Rdd1.subtract (RDD2). Collect ()

Three: Action operation


Collect (): Return all elements
SYSTEM.OUT.PRINTLN ("raw data:" + rdd.collect ());

COUNT (): Returns the number of elements
System.out.println ("All elements of the statistical rdd:" + rdd.count ());

Countbyvalue (): Number of occurrences of each element
System.out.println ("Number of occurrences per element:" + rdd.countbyvalue ());

Take (num): Returns a NUM element
System.out.println ("Take Out the Rdd returns 2 elements:" + Rdd.take (2));

Top (num): Returns the first NUM elements
System.out.println ("Remove the RDD to return the top 2 elements:" + rdd.top (2));


Reduce (func): consolidates all data in the RDD in parallel (most commonly used)
System.out.println ("Consolidates all data in the RDD (sum):" + rdd.reduce ((v1, v2), V1 + v2));

foreach (func): use Func for each element
Rdd.foreach (T-System.out.print (t));


Four: Control operation


Cache ():

Persist (): The dependency of the RDD is preserved

Checkpoint (Level:storagelevel): rdd[t] disconnecting the RDD dependency

The so-called control operation is persistence
You can persist an RDD through the persist () or the cache () method. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.
In addition, we can store each of the persistent rdd with different storage levels.
Spark automatically monitors the usage of each node cache and removes old data using the least recent usage principle. If you want to remove the RDD manually, you can use the Rdd.unpersist () method.
In practice we can use third parties for data persistence such as: Redis

Spark Learning Javardd

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.