RDD Introduction
The RDD, full name resilient distributed Datasets (elastic distributed data Set), is the core concept of spark and is the abstraction of Spark's data. The RDD is a distributed collection of elements, each of which supports only read operations, and each RDD is partitioned into multiple partitions that are stored on different nodes of the cluster. In addition, RDD allows the user to display the specified data stored in memory and disk, mastering the RDD programming is the first step in spark development.
1: Create operation (Creation operation): the creation of the RDD is the responsibility of sparkcontext.
2: Conversion operation (transformation operation): Converts an rdd to another rdd through a certain operation.
3: Action operation: Spark is lazy and action on RDD triggers the operation of the spark job
4: Control operation: The RDD is persisted and so on.
Demo Code Address: Https://github.com/zhp8341/sparkdemo/blob/master/src/main/java/com/demo/spark/rdddemo/OneRDD.java
One: Create an action
There are two ways of creating an Rdd:
1 reading a data set (Sparkcontext.textfile ()):
Javadstreamlines=jssc.textfilestream ("/users/huipeizhu/documents/sparkdata/input/");
Javareceiverinputdstreamlines = Jssc.sockettextstream ("localhost", 9999);
2 read a collection (Sparkcontext.parallelize ()):
Listlist = Arrays.aslist (5, 4, 3, 2, 1);
Javarddrdd = sc.parallelize (list), second: conversion operation
1: Single RDD conversion operation
Map (): operates on each element, returning a new Rdd
System.out.println ("Rdd per element multiply:" + rdd.map (V-V * 10)
Filter (): the most each element is filtered to return a new RDD consisting of eligible elements
System.out.println ("Rdd removes elements of 1:" + rdd.filter (V-V! = 1));
FlatMap (): operates on each element, returning all elements of the returned iterator to a new RDD
R.dd.flatmap (X-x.to (3)). Collect ()
Distinct (): Redo operation
System.out.println ("Rdd de-re-operation:" + rdd.distinct ());
RDD Maximum and minimum values
IntegerMax= rdd. Reduce ((v1, v2) -> Math. Max (v1, v2));
Integer Min= rdd.reduce ((v1, v2) - > math .min (v1, v2))
2: Two RDD conversion actions:
[1, 2, 3] [3, 4, 5] Two RDD simple related operations
Union (): merge, do not go heavy
System.out.println ("Two Rdd sets:" + rdd1.union (RDD2). Collect ());
Intersection (): Intersection
System.out.println ("Two Rdd collection Common elements:" + rdd1.intersection (RDD2). Collect ());
Cartesian (): Cartesian product
The Cartesian product of System.out.println ("and another Rdd set:" + Rdd1.cartesian (RDD2). Collect ());
Subtract (): Remove the same content
Rdd1.subtract (RDD2). Collect ()
Three: Action operation
Collect (): Return all elements
SYSTEM.OUT.PRINTLN ("raw data:" + rdd.collect ());
COUNT (): Returns the number of elements
System.out.println ("All elements of the statistical rdd:" + rdd.count ());
Countbyvalue (): Number of occurrences of each element
System.out.println ("Number of occurrences per element:" + rdd.countbyvalue ());
Take (num): Returns a NUM element
System.out.println ("Take Out the Rdd returns 2 elements:" + Rdd.take (2));
Top (num): Returns the first NUM elements
System.out.println ("Remove the RDD to return the top 2 elements:" + rdd.top (2));
Reduce (func): consolidates all data in the RDD in parallel (most commonly used)
System.out.println ("Consolidates all data in the RDD (sum):" + rdd.reduce ((v1, v2), V1 + v2));
foreach (func): use Func for each element
Rdd.foreach (T-System.out.print (t));
Four: Control operation
Cache ():
Persist (): The dependency of the RDD is preserved
Checkpoint (Level:storagelevel): rdd[t] disconnecting the RDD dependency
The so-called control operation is persistence
You can persist an RDD through the persist () or the cache () method. First, the RDD is computed in the action, and then it is saved in the memory of each node. The spark cache is a fault-tolerant technique-if any one of the RDD partitions is lost, it can be automatically duplicated and created by the original conversion (transformations) operation.
In addition, we can store each of the persistent rdd with different storage levels.
Spark automatically monitors the usage of each node cache and removes old data using the least recent usage principle. If you want to remove the RDD manually, you can use the Rdd.unpersist () method.
In practice we can use third parties for data persistence such as: Redis
Spark Learning Javardd