Large data 10_02_sparkstreaming input sources, Foreachrdd, transform, Updatestatebykey, Reducebykeyandwindow_

Large data 10_02_sparkstreaming input sources, Foreachrdd, transform, Updatestatebykey, Reducebykeyandwindow__c languages

Last Update:2018-07-25 Source: Internet

Author: User

Tags class operator iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic Data Source

1. File Flow

Reading data from a file

lines= Ssc.textfilestream ("File:///usr/local/spark/mycode/streaming/logfile")

2. Socket Stream

Spark streaming can listen and receive data through the socket port and then handle it accordingly.

Javareceiverinputdstream<string> lines = Jsc.sockettextstream ("Weekend10", 9999); 9999 is the port number of nc-lk 9999.

3.RDD Queue Flow

When debugging spark streaming applications, we can use Streamingcontext.queuestream (QUEUEOFRDD) to create RDD based on Dstream queues.

Advanced Data Sources

1.Apache Kafka as a dstream data source

2.Apache Flume as a dstream data source

3.DStream Conversion Operation

4.DStream Output Operation

Foreachrdd Operator: The function is to convert Dstream to a rdd at the bottom. Within the FOREACHRDD operator, code obtained outside the RDD operator is executed at the driver end. the Action class operator must be executed on the extracted rdd to execute the code.

each batchinterval executes a foreachrdd, which can be used to dynamically change the broadcast variable.

Note:

/*
* Counts is a dstream and Foreachrdd is an operator of the Outputoperator class.
* The role of Foreachrdd is to get Dstream's rdd.
* The parameter in the Foreachrdd method is the Rdd in the Dstream. As long as the RDD with pair format is the RDD of K.V format.
* Within the Foreachrdd method, the RDD operator is performed in driver.
*/

/**
* 1, the number of simulated threads in the local must be greater than or equal to 2 because one thread is occupied by receiver (the thread that accepts the data), and the other thread is the job execution
* 2, the durations time setting, is we can accept the delay degree, this we need according to the cluster resource situation as well as monitors each job the execution time to adjust the best time.
* 3, the creation of Javastreamingcontext there are two ways (sparkconf, Sparkcontext)
* 4, after the business logic is complete, need to have an output operator
* 5, Javastreamingcontext.start () straming After the framework is started, the business logic cannot be added at a time
* 6, Javastreamingcontext.stop () The Stop method with no parameters closes the Sparkcontext together, stop (false), defaults to True, and closes together
* 7, Javastreamingcontext.stop () After the stop is unable to call the start
*/
public class Wordcountonline {
@SuppressWarnings ("deprecation")
public static void Main (string[] args) {
sparkconf conf = new sparkconf (). Setmaster ("local[2]"). Setappname ("Wordcountonline");
/**
* Set up batch Interval when creating Streamincontext
*/
Javastreamingcontext JSC = new Javastreamingcontext (conf, Durations.seconds (5));
Javasparkcontext sc = new Javasparkcontext (conf);
Javastreamingcontext JSC = new Javastreamingcontext (Sc,durations.seconds (5));
Javareceiverinputdstream<string> lines = Jsc.sockettextstream ("Node5", 9999);
javadstream<string> words = Lines.flatmap (new flatmapfunction<string, string> () {
Private static final long serialversionuid = 1L;
@Override
Public iterable<string> Call (String s) {
Return Arrays.aslist (S.split (""));
}
});
javapairdstream<string, integer> ones = Words.maptopair (new pairfunction<string, String, Integer> () {
Private static final long serialversionuid = 1L;
@Override
Public tuple2<string, Integer> call (String s) {
return new tuple2<string, integer> (S, 1);
}
});
Javapairdstream<string, integer> counts = Ones.reducebykey (new Function2<integer, Integer, integer> () {
Private static final long serialversionuid = 1L;
Public integer Call (integer i1, integer i2) {
return i1 + i2;
}
});
Operator of Outputoperator class
Counts.print ();
/*counts.foreachrdd (New voidfunction<javapairrdd<string,integer>> () {
Private static final long serialversionuid = 1L;
@Override
public void Call (Javapairrdd<string, integer> Pairrdd) throws Exception {
Pairrdd.foreach (New voidfunction<tuple2<string,integer>> () {
Private static final long serialversionuid = 1L;
@Override
public void Call (tuple2<string, integer> tuple)
Throws Exception {
System.out.println ("tuple----" +tuple);
}
});
}
});*/
Jsc.start ();
Wait for Spark program to be terminated
Jsc.awaittermination ();
Jsc.stop (FALSE);
}
}

Transform operator: The function is to convert the Dstream to a rdd at the bottom of a layer, and to implement any operation Rdd to other types Rdd in Dstream

within the transform operator, code obtained outside the RDD operator is executed at the driver end.

each batchinterval executes a foreachrdd, and you can use this operator to dynamically change the broadcast variable

/**
* Transform:
* Returns a new dstream by applying the RDD to the RDD function for each rdd in the Dstream. This can be used to perform any RDD operation on the Dstream.
*
*/
public class Operate_transform {
public static void Main (string[] args) {
sparkconf conf = new sparkconf (). Setmaster ("local"). Setappname ("Operate_transform");
Javastreamingcontext JSC = new Javastreamingcontext (Conf,durations.seconds (5));
javadstream<string> Textfilestream = Jsc.textfilestream ("Data");
/*
* The role of the transform operator: To obtain the RDD in the Dstream and then format the RDD of Dstream, such as converting non-K, v-format rdd to K, v format rdd
* Then the RDD format in the final returned Dstream is the converted Rdd format.
*/

Textfilestream.transform (New function<javardd<string>,javardd<string>> () {
Private static final long serialversionuid = 1L;
Public javardd<string> Call (Javardd<string> v1) throws Exception {
V1.foreach (New voidfunction<string> () {
Private static final long serialversionuid = 1L;
public void call (String t) throws Exception {
System.err.println ("**************" +t);
}
});
return v1;
}

}). print ();
Jsc.start ();
Jsc.awaittermination ();
Jsc.close ();
}
}

Updatestatebykey Operator: it is transformation operator

Updatestatebykey function: The statistical data is from the beginning of the batch incoming data start statistics. Rather than counting the data for each batch. It cannot count data for one day or an hour.

Note: If you want to constantly update the state of each key, it must involve the status of preservation and fault tolerance, this time need to open the checkpoint mechanism and function.

/**
* Updatestatebykey:
* Returns a new "state" Dstream, which can be used to maintain arbitrary state data for a key, by updating the value of the key corresponding to each state prior to the given Func.
* Note: function on the Dstream (K,V) format
*
* Updatestatebykey main functions: 1, Spark streaming for each key to maintain a state, the type can be arbitrary type,
* Can be a custom object, the update function can also be customized. 2, through the update function on the state of the key constantly updated, for each new batch, Spark
* Streaming will state updates for existing keys when using Updatestatebykey
* (for every new key that appears, the same is done by the state's update function),
* If you want to constantly update the state of each key, it must involve the status of preservation and fault tolerance, this time need to open the checkpoint mechanism and function
*
* @author Root
*
*/
public class Operate_updatestatebykey {
public static void Main (string[] args) {
sparkconf conf = new sparkconf (). Setmaster ("local"). Setappname ("Operate_count");
Javastreamingcontext JSC = new Javastreamingcontext (Conf,durations.seconds (5));
Jsc.checkpoint ("Checkpoint");
javadstream<string> Textfilestream = Jsc.textfilestream ("Data");
/**
* Implementation of a cumulative statistics word function
*/
javapairdstream<string, integer> Maptopair = Textfilestream.flatmap (New flatmapfunction<string, String> ( ) {
Private static final long serialversionuid = 1L;
Public iterable<string> call (String t) throws Exception {
Return Arrays.aslist (T.split (""));
}
}). Maptopair (New pairfunction<string, String, integer> () {
Private static final long serialversionuid = 1L;
Public tuple2<string, integer> call (String t) throws Exception {
return new tuple2<string, integer> (T.trim (), 1);
}
});
javapairdstream<string, integer> updatestatebykey = Maptopair.updatestatebykey (New Function2<List< Integer>, Optional<integer>, optional<integer>> () {
Private static final long serialversionuid = 1L;
Public optional<integer> Call (list<integer> values, optional<integer> state)
Throws Exception {
/**
* Values: The value [1,1,1,1,1] corresponding to the last key that is grouped
* State: The status of this key prior to this time
*/
Integer updatevalue = 0;
if (State.ispresent ()) {
Updatevalue = State.get ();
}
for (Integer i:values) {
Updatevalue = i;
}
Return Optional.of (Updatevalue);
}
});
Updatestatebykey.print ();
Jsc.start ();
Jsc.awaittermination ();
Jsc.close ();
}
}

window function: Reducebykeyandwindow (func, Windowlength, Slideinterval, [numtasks]): Gets the data for a certain amount of time.

When used on a dstream (k,v) format, each K-corresponding V is aggregated by an incoming Func function, returning a new Dstream in a (k,v) format

Durations.seconds (20) in the following method is the window's length of 20 seconds, the time interval of the Durations.seconds (10) window, which means that the window function performs the last 20 seconds of data every 10 seconds.

public class Operate_reducebykeyandwindow {

public static void Main (string[] args) {
sparkconf conf = new sparkconf (). Setmaster ("local"). Setappname ("Operate_countbywindow");
Javastreamingcontext JSC = new Javastreamingcontext (Conf,durations.seconds (5));
Jsc.checkpoint ("Checkpoint");
javadstream<string> Textfilestream = Jsc.textfilestream ("Data");
/**
* First convert Textfilestream to tuple format to count word words
*/
javapairdstream<string, integer> Maptopair = Textfilestream.flatmap (New flatmapfunction<string, String> ( ) {

Private static final long serialversionuid = 1L;

Public iterable<string> call (String t) throws Exception {
Return Arrays.aslist (T.split (""));
}
}). Maptopair (New pairfunction<string, String, integer> () {

Private static final long serialversionuid = 1L;

Public tuple2<string, integer> call (String t) throws Exception {
return new tuple2<string, integer> (T.trim (), 1);
}
});

javapairdstream<string, integer> Reducebykeyandwindow =
Maptopair.Reducebykeyandwindow(New function2<integer,integer,integer> () {
Private static final long serialversionuid = 1L;

Public integer Call (integer v1, Integer v2) throws Exception {
return v1+v2;
}

},durations.seconds (durations.seconds ));

Reducebykeyandwindow.print ();

Jsc.start ();
Jsc.awaittermination ();
Jsc.close ();
}
}

Window Operation Understanding Diagram:

Assuming that every 5s 1 batch, the window length is 15s, the window sliding interval 10s.

The window length and sliding interval must be integral times of batchinterval. If not integer multiples, the error is detected. Optimization of window functions:

/**
* Reducebykeyandwindow (func, Invfunc, Windowlength, Slideinterval, [Numtasks]):
* Window Length (windowlength): Duration of the window
* Sliding interval (slideinterval): Interval to perform window operations
* This is a more efficient version than the previous Reducebykeyandwindow (),
* Calculates the current reduce value for each window incrementally, based on the reduce value of the previous window.
* This is done by handling the new data entering the sliding window, and the "reversible processing" of the old data that leaves the window.
* An example is the number of "add" and "decrease" keys when the window is sliding.
* However, it applies only to the "reversible reduce function", which is the reduce function with the corresponding "reversible reduce" function (as parameter invfunc).
* As in Reducebykeyandwindow, the number of reduce tasks can be configured with optional parameters.
* Note that you must enable checkpointing with this operation. That is, the optimized window function requires checkpoint.
* The above meaning is to pass a parameter Reducebykeyandwindow each calculation contains multiple batches, each time will be calculated from the new. resulting in a low efficiency because of the duplication of data
* Pass two parameters Reducebykeyandwindow is based on the last calculated results, calculate the results of each key, you can draw a signal.
* @author Root
*
*/
public class Operate_reducebykeyandwindow_2 {
public static void Main (string[] args) {
sparkconf conf = new sparkconf (). Setmaster ("local"). Setappname ("Operate_countbywindow");
Javastreamingcontext JSC = new Javastreamingcontext (Conf,durations.seconds (5));
Jsc.checkpoint ("Checkpoint");
javadstream<string> Textfilestream = Jsc.textfilestream ("Data");
/**
* First convert Textfilestream to tuple format to count word words
*/
javapairdstream<string, integer> Maptopair = Textfilestream.flatmap (New flatmapfunction<string, String> ( ) {
Private static final long serialversionuid = 1L;
Public iterable<string> call (String t) throws Exception {
Return Arrays.aslist (T.split (""));
}
}). Maptopair (New pairfunction<string, String, integer> () {
Private static final long serialversionuid = 1L;
Public tuple2<string, integer> call (String t) throws Exception {
return new tuple2<string, integer> (T.trim (), 1);
}
});

javapairdstream<string, integer> Reducebykeyandwindow = Maptopair.reducebykeyandwindow (New Function2< Integer, Integer, integer> () {
Private static final long serialversionuid = 1L;
/**
* The v1 here refers to the value value of the key of the previous state (if there is a batch value to go out, the V1 is the one returned by the second function below), and V2 is the value of the read in
*/
Public integer Call (integer v1, Integer v2) throws Exception {
System.out.println ("***********v1*************" +v1);
System.out.println ("***********v2*************" +v2);
return v1+v2;
}
}, New Function2<integer,integer,integer> () {

Private static final long serialversionuid = 1L;
/**
* The Function2 of this second parameter here is the windowlength time after which the V1 is the latest value of a function just added to the value of the key recently read.
* V2 is the number of values that go out in the sliding interval after the window is sliding
* The value returned is the input value of the V1 of the above function
*/
Public integer Call (integer v1, Integer v2) throws Exception {

System.out.println ("^^ ^^ ^^ ^^ ^^ the ^^ ^^ the ^^" +v1) ^^ "(^v1^);
System.out.println ("^^ ^^ ^^ ^^ ^^ the ^^ ^^ the ^^" +v2) ^^ "(^v2^);

Return v1-v2-1;//decrements 1 per output
return v1-v2;
}

(durations.seconds), durations.seconds (10));
Reducebykeyandwindow.print ();

Jsc.start ();
Jsc.awaittermination ();
Jsc.close ();
}
}

Other transform operators of sparkstreaming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More