Getting Started with Spark Streaming

Last Update:2020-06-11 Source: Internet

Author: User

Keywords spark spark streaming spark streaming introduction

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview
Hadoop's MapReduce and Spark SQL can only perform offline calculations, and cannot meet business requirements with high real-time requirements, such as real-time recommendations, real-time website performance analysis, etc. Streaming computing can solve these problems. Spark Streaming is now commonly used streaming Calculation framework. As one of the five core components of Spark, Spark Streaming natively supports access to multiple data sources, and can be used in conjunction with Spark MLLib, Graphx, with high throughput, fault tolerance mechanism, data can be from Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP ports can be processed by algorithms similar to advanced functions such as map, reduce, join, and window. Eventually, the processed data can be pushed to disk and database. In short, the function of Spark Streaming is to process data from different data sources in real time and output the results to an external file system.

working principle

Coarse grain
Spark Streaming receives the real-time data stream, and cuts the data into small data blocks according to the specified time period.
Then pass the small data block to Spark Engine for processing.

Fine-grained
Receive the real-time input data stream, and then split the data into multiple batches. For example, every 1 second of data collected is encapsulated into a batch, and then each batch is handed over to Spark's calculation engine for processing, and finally a result data stream is produced. The data in it is also composed of batches.

Spark Streaming provides a high-level abstraction, called DStream, English is called Discretized Stream, Chinese translation is "discrete stream", it represents a continuous data stream. DStream can be created by input data sources, such as Kafka, Flume, ZMQ, and Kinesis; it can also be created by applying higher-order functions to other DStreams, such as map, reduce, join, window.
Inside DStream, in fact, a series of continuously generated RDDs. RDD is the core abstraction of Spark Core, that is, immutable, distributed data sets. Each RDD in DStream contains data within a time period.

Operators applied to DStream, such as map, are actually translated into operations on each RDD in DStream at the bottom. For example, performing a map operation on a DStream will generate a new DStream. However, at the bottom, in fact, the principle is that the map operation is applied to the RDD of each time period in the input DStream, and then the new RDD generated is the RDD of that time period in the new DStream. The underlying RDD transformation operation.
Still implemented by Spark Core's computing engine. Spark Streaming encapsulates Spark Core, hides the details, and then provides developers with a high-level API that is easy to use.

Introduction to the basic working principle of Spark Streaming
Actual combat
wordcount case (real-time statistics)
Requirement: Enter characters dynamically, and calculate the number of occurrences of entered characters in real time through Spark Streaming.

Code description
A similar case of spark streaming is provided in the examples file installed by spark. You can view the corresponding code on github. We used the case of JavaNetworkWordCount, and named the usage in the code.
We submit jobs on spark in the following two ways.

spark-submit
./spark-submit --master local[2] --class org.apache.spark.examples.streaming.JavaNetworkWordCount --name NetworkWordCount ../examples/jars/spark-examples_2.11-2.1.0.jar localhost 9999
test
nc -lk 9999
If the prompt nc: command not found indicates that the nc package is not installed, use the following command to install
yum install nc -y
yum install nmap -y

spark-shell submission
Start spark-shell
./spark-shell --master local[2]
Execute the following code after startup

import org.apache.spark.streaming.{Seconds,StreamingContext};
val ssc = new StreamingContext(sc, Seconds(1));
val lines = ssc.socketTextStream("192.168.30.130", 9999);
val words = lines.flatMap(_.split(" "));
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _);
wordCounts.print();
ssc.start();
ssc.awaitTermination();

The difference between the two:
spark-submit is used in a production environment, and spark-shell is used for code testing during development.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More