Spark Learning Javardd

Last Update:2018-06-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RDD Introduction

The RDD, full name resilient distributed Datasets (elastic distributed data Set), is the core concept of spark and is the abstraction of Spark's data. The RDD is a distributed collection of elements, each of which supports only read operations, and each RDD is partitioned into multiple partitions that are stored on different nodes of the cluster. In addition, RDD allows the user to display the specified data stored in memory and disk, mastering the RDD programming is the first step in spark development.

1: Create operation (Creation operation): the creation of the RDD is the responsibility of sparkcontext.
2: Conversion operation (transformation operation): Converts an rdd to another rdd through a certain operation.
3: Action operation: Spark is lazy and action on RDD triggers the operation of the spark job
4: Control operation: The RDD is persisted and so on.

Demo Code Address: Https://github.com/zhp8341/sparkdemo/blob/master/src/main/java/com/demo/spark/rdddemo/OneRDD.java

One: Create an action

There are two ways of creating an Rdd:
1 reading a data set (Sparkcontext.textfile ()):

Javadstreamlines=jssc.textfilestream ("/users/huipeizhu/documents/sparkdata/input/");
Javareceiverinputdstreamlines = Jssc.sockettextstream ("localhost", 9999);

2 read a collection (Sparkcontext.parallelize ()):

Listlist = Arrays.aslist (5, 4, 3, 2, 1);
Javarddrdd = sc.parallelize (list), second: conversion operation

1: Single RDD conversion operation
Map (): operates on each element, returning a new Rdd
System.out.println ("Rdd per element multiply:" + rdd.map (V-V * 10)

Filter (): the most each element is filtered to return a new RDD consisting of eligible elements
System.out.println ("Rdd removes elements of 1:" + rdd.filter (V-V! = 1));

FlatMap (): operates on each element, returning all elements of the returned iterator to a new RDD
R.dd.flatmap (X-x.to (3)). Collect ()

Distinct (): Redo operation
System.out.println ("Rdd de-re-operation:" + rdd.distinct ());

RDD Maximum and minimum values

IntegerMax= rdd. Reduce ((v1, v2) -> Math. Max (v1, v2));

Integer Min= rdd.reduce ((v1, v2) - > math .min (v1, v2))

2: Two RDD conversion actions:

[1, 2, 3] [3, 4, 5] Two RDD simple related operations

Union (): merge, do not go heavy
System.out.println ("Two Rdd sets:" + rdd1.union (RDD2). Collect ());

Intersection (): Intersection
System.out.println ("Two Rdd collection Common elements:" + rdd1.intersection (RDD2). Collect ());

Cartesian (): Cartesian product
The Cartesian product of System.out.println ("and another Rdd set:" + Rdd1.cartesian (RDD2). Collect ());

Subtract (): Remove the same content
Rdd1.subtract (RDD2). Collect ()

Three: Action operation

Collect (): Return all elements
SYSTEM.OUT.PRINTLN ("raw data:" + rdd.collect ());

COUNT (): Returns the number of elements
System.out.println ("All elements of the statistical rdd:" + rdd.count ());

Countbyvalue (): Number of occurrences of each element
System.out.println ("Number of occurrences per element:" + rdd.countbyvalue ());

Take (num): Returns a NUM element
System.out.println ("Take Out the Rdd returns 2 elements:" + Rdd.take (2));

Top (num): Returns the first NUM elements
System.out.println ("Remove the RDD to return the top 2 elements:" + rdd.top (2));

Reduce (func): consolidates all data in the RDD in parallel (most commonly used)
System.out.println ("Consolidates all data in the RDD (sum):" + rdd.reduce ((v1, v2), V1 + v2));

foreach (func): use Func for each element
Rdd.foreach (T-System.out.print (t));

Four: Control operation

Cache ():

Persist (): The dependency of the RDD is preserved

Checkpoint (Level:storagelevel): rdd[t] disconnecting the RDD dependency

The so-called control operation is persistence
You can persist an RDD through the persist () or the cache () method. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.
In addition, we can store each of the persistent rdd with different storage levels.
Spark automatically monitors the usage of each node cache and removes old data using the least recent usage principle. If you want to remove the RDD manually, you can use the Rdd.unpersist () method.
In practice we can use third parties for data persistence such as: Redis

Spark Learning Javardd

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Learning Javardd

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Learning Javardd

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support