Getting Started with Spark Streaming

Source: Internet
Author: User
Keywords spark spark streaming spark streaming introduction
Overview
Hadoop's MapReduce and Spark SQL can only perform offline calculations, and cannot meet business requirements with high real-time requirements, such as real-time recommendations, real-time website performance analysis, etc. Streaming computing can solve these problems. Spark Streaming is now commonly used streaming Calculation framework. As one of the five core components of Spark, Spark Streaming natively supports access to multiple data sources, and can be used in conjunction with Spark MLLib, Graphx, with high throughput, fault tolerance mechanism, data can be from Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP ports can be processed by algorithms similar to advanced functions such as map, reduce, join, and window. Eventually, the processed data can be pushed to disk and database. In short, the function of Spark Streaming is to process data from different data sources in real time and output the results to an external file system.


working principle

Coarse grain
Spark Streaming receives the real-time data stream, and cuts the data into small data blocks according to the specified time period.
Then pass the small data block to Spark Engine for processing.

Fine-grained
Receive the real-time input data stream, and then split the data into multiple batches. For example, every 1 second of data collected is encapsulated into a batch, and then each batch is handed over to Spark's calculation engine for processing, and finally a result data stream is produced. The data in it is also composed of batches.


Spark Streaming provides a high-level abstraction, called DStream, English is called Discretized Stream, Chinese translation is "discrete stream", it represents a continuous data stream. DStream can be created by input data sources, such as Kafka, Flume, ZMQ, and Kinesis; it can also be created by applying higher-order functions to other DStreams, such as map, reduce, join, window.
Inside DStream, in fact, a series of continuously generated RDDs. RDD is the core abstraction of Spark Core, that is, immutable, distributed data sets. Each RDD in DStream contains data within a time period.


Operators applied to DStream, such as map, are actually translated into operations on each RDD in DStream at the bottom. For example, performing a map operation on a DStream will generate a new DStream. However, at the bottom, in fact, the principle is that the map operation is applied to the RDD of each time period in the input DStream, and then the new RDD generated is the RDD of that time period in the new DStream. The underlying RDD transformation operation.
Still implemented by Spark Core's computing engine. Spark Streaming encapsulates Spark Core, hides the details, and then provides developers with a high-level API that is easy to use.


Introduction to the basic working principle of Spark Streaming
Actual combat
wordcount case (real-time statistics)
Requirement: Enter characters dynamically, and calculate the number of occurrences of entered characters in real time through Spark Streaming.

Code description
A similar case of spark streaming is provided in the examples file installed by spark. You can view the corresponding code on github. We used the case of JavaNetworkWordCount, and named the usage in the code.
We submit jobs on spark in the following two ways.

spark-submit
./spark-submit --master local[2] --class org.apache.spark.examples.streaming.JavaNetworkWordCount --name NetworkWordCount ../examples/jars/spark-examples_2.11-2.1.0.jar localhost 9999
test
nc -lk 9999
If the prompt nc: command not found indicates that the nc package is not installed, use the following command to install
yum install nc -y
yum install nmap -y


spark-shell submission
Start spark-shell
./spark-shell --master local[2]
Execute the following code after startup

import org.apache.spark.streaming.{Seconds,StreamingContext};
val ssc = new StreamingContext(sc, Seconds(1));
val lines = ssc.socketTextStream("192.168.30.130", 9999);
val words = lines.flatMap(_.split(" "));
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _);
wordCounts.print();
ssc.start();
ssc.awaitTermination();


The difference between the two:
spark-submit is used in a production environment, and spark-shell is used for code testing during development.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.