This lesson summary:
(1) What is flow processing and spark streaming main introduction
(2) Spark streaming first Experience
First, what is flow processing and spark streaming main introduction
stream (streaming), in the big Data era for data stream processing, like water flow, is the data flow, since it is data flow processing, will think of data flow, data processing, data outflow.
There are many different places in daily work and data sources in life. For example, the industrial era of automobile manufacturing, monitoring equipment, industrial equipment will generate a lot of source data; e-commerce websites in the information age, log servers, social networks, financial trading systems, hacking, spam, traffic monitoring, and so on, mobile phones, tablets, smart devices, Internet of things, etc. in the age of communication can produce many real-time data, Data flow is everywhere.
what can spark streaming do in the big data age?
Usually the user has the experience of online shopping, the user on the site of various operations through the spark streaming stream processing technology can be monitored, the user's purchase hobby, attention, transactions, etc. can conduct behavioral analysis. In the financial sector, the use of spark streaming stream processing technology can be used to monitor large volumes of accounts to prevent criminals money laundering, property transfer, fraud prevention and so on. In the network security aspect, the hacker attack happens frequently, through the spark streaming stream processing technology can monitor certain kind of suspicious IP and combine the machine learning training model to match whether the current request is a hacker attack. Other aspects, such as spam monitoring and filtering, traffic monitoring, network monitoring, and industrial equipment monitoring, are behind the powerful stream processing of spark streaming.
in the era of big data, how do data values generally define?
All data that has not been streamed is either invalid or worthless, and the value that is generated by processing the data immediately after it is generated is the largest, and the longer the data is placed or the longer it is, the lower its use value. Most of the previous e-commerce website profit is the network traffic (that is, the number of users), today, e-commerce website not only need to focus on traffic, volume, more importantly, through the data stream technology to the e-commerce website of various data flow, through real-time flow of data timely analysis, mining a variety of valuable data , such as: to different volume of users to specify the user portrait, so as to provide different quality of service, quasi-user access to the e-commerce website Plate Hobby Timely recommendation of relevant information.
Spark streaming VS Hadoop MR:
The spark streaming is a quasi-real-time streaming framework, and Hadoop Mr is an offline, batch-processing framework, and it is clear that spark streaming is better than hadoop Mr In terms of the value of the data.
Spark streaming VS Storm:
Spark streaming is a quasi-real-time streaming framework that handles response times in minutes, which means that the latency of processing real-time data is second-level; Storm is a real-time streaming framework that handles response milliseconds. So in the flow frame selection depends on the specific business scenario. What needs to be clarified is that many people now think that spark streaming stream processing is unstable, data is lost, transactional support is bad, and so on, because many people will not be able to harness spark streaming and spark itself. In terms of the delay time of the spark streaming stream processing, the upcoming spark customization version of Dt_spark Big data DreamWorks will push spark streaming latency from second to 100 milliseconds or less.
Spark Streaming Benefits:
1, provides a rich API, the enterprise can quickly achieve a variety of complex business logic.
2. The flow of data into spark streaming is accomplished by combining machine learning algorithms with machine simulation and graph calculation.
3. Spark streaming is based on Spark's excellent pedigree.
Spark Streaming can you handle the data like storm?
The way storm handles data is one-to-one processing in bars, and spark streaming is working with data per unit of time, can spark streaming be like storm? The answer is: yes.
It is common practice for the spark streaming and Kafka partners to achieve this effect by entering:
The Kafka industry recognizes the most mainstream distributed messaging framework, which conforms to the message broadcast pattern and conforms to the Message Queuing pattern.
Kafka internal use of technology:
1. Cache
2, Interface
3, persistence (default maximum persistence of one week)
4, Zero-copy Technology let Kafka throughput of hundreds of trillion per second, and data only need to be loaded once to the core to provide other applications to use
External various source data propulsion (push) Kafka, and then through the spark streaming fetch (pull) data, the amount of data captured can be based on their actual situation to determine how much data to process per second.
Second, Spark streaming first experience
Below we use spark streaming hands-on combat wordcount instances.
Here is the program to run a spark streaming: count the number of occurrences of the word that is flowing in this time period. It calculates how many times each word appears in the time period he has specified.
1. Start the Spark cluster:
We open the official website from within the cluster.
Accept this data for processing, is the process of flow processing, just now that wordcount is 1s to do a unit. Why is there no result when I just ran? Because the data source is required.
2. Get the data source:
Open a new command terminal, and then enter:
$ 9999
Now we copy the data source into the run:
Dstream and Rdd relationships:
No input data will print an empty result:
But in fact, the execution of the job is not related to the spark Code business logic that the spark streaming framework has generated for us, and the spark streaming framework's execution interval can be configured manually, such as a job call that occurs every second. So when developers write good spark code (such as: Flatmap, map, collect), will not cause the job to run, the job run is generated by the spark streaming framework, can be configured to produce a job call every second.
Spark streaming The data that comes in is dstream, but the spark core framework only has the RDD, which creates a contradiction?
In the Spark streaming framework, job instances are generated based on an RDD instance, and the code you write is the template for the job, that is, the RDD is the template for the job, and the template is executed when the RDD is run, and the action must process the data. The template of the RDD is dstream discrete flow, there is a dependency between the Rdd, Dstream has a dependency, it also constitutes a dstream to the non-circular graph. This DAG diagram, is a template. Spark streaming is just a thin layer of encapsulation attached to the RDD. The code you write does not create a job, only the framework can generate the job. If the data is not finished in one second, it can only be tuned.
Summary: In this lesson we understand what flow processing is and how spark streaming is streamed, and we experience the power of spark streaming in a pure combat way with wordcount instances, The relationship between Dstream and Rdd is simply parsed, and this is what we understand spark streaming very important. We will analyze these content in depth later on.
Copyright Notice: This article for Bo Master original article, without Bo Master permission not reproduced
82nd Spark Streaming First lesson case hands-on and understanding how it works between milliseconds