1. What is Spark streaming?
A, what is Spark streaming?
Spark streaming is similar to Apache Storm, and is used for streaming data processing. According to its official documentation, Spark streaming features high throughput and fault tolerance. Spark Streaming supports a wide range of data input sources, such as Kafka, Flume, Twitter, ZEROMQ, and simple TCP sockets. Data input can be calculated using Spark's highly abstract primitives such as: map, reduce, join, window, and so on. And the results can be saved in many places, such as HDFS, database, etc. Spark streaming also blends perfectly with mllib (machine learning) and GRAPHX.
B, Spark streaming features?
Ease of use, fault tolerance, easy integration into the spark system,
2. Spark vs Storm
A, Spark development language: Scala, Storm's development language: Clojure.
B, Spark programming Model: DStream, Storm programming Model: Spout/bolt.
A comparison of C, Spark, and Storm describes:
Spark:
Storm:
3. What is Dstream?
3.1. discretized stream is the basic abstraction of spark streaming, representing the continuous flow of data and the resulting data flow after various spark primitive operations. In the internal implementation, the Dstream is represented by a series of successive Rdd. Each RDD contains data over a period of time, such as:
The operation of the data is also based on The RDD is for the unit:
The calculation process is performed by Spark engine to complete
3.2, Dstream related operation:
The primitive on the dstream is similar to the RDD, which is divided into transformations (conversion) and output Operations (output), in addition, there are some more special primitives in the conversion operation, such as: Updatestatebykey (), Transform () and various window-related primitives.
A, transformations on DStream:
Transformation |
Meaning |
Map (func) |
Return a new DStream by passing each element of the source DStream through a function func. |
FlatMap (func) |
Similar to map, and each input item can mapped to 0 or more output items. |
Filter (func) |
Return a new DStream by selecting only the records of the source DStream on which Func returns true. |
Repartition (Numpartitions) |
Changes the level of parallelism in this DStream by creating more or fewer partitions. |
Union (Otherstream) |
Return A new DStream that contains the union of the elements in the source DStream and Otherdstream. |
Count () |
Return a new DStream of Single-element RDDs by counting the number of elements in each RDD of the source DStream. |
Reduce (func) |
Return a new DStream of Single-element RDDs by aggregating the elements in each RDD of the source DStream using a function Func (which takes, arguments and returns one). The function should is associative so, it can be computed in parallel. |
Countbyvalue () |
When called in a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key was its Frequency in each RDD of the source DStream. |
Reducebykey (func, [Numtasks]) |
When called in a DStream of (k, V) pairs, return a new DStream of (K, V) pairs where the values for each key is aggregate D using the given reduce function. Note:by default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is Determined by the Config property spark.default.parallelism) to do the grouping. You can pass an optional numtasks argument to set a different number of tasks. |
Join (Otherstream, [numtasks]) |
When called on the Dstreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elemen TS for each key. |
Cogroup (Otherstream, [numtasks]) |
When called on a DStream of (k, V) and (k, W) pairs, return a new DStream of (K, Seq[v], seq[w]) tuples. |
Transform (func) |
Return a new DStream by applying a rdd-to-rdd function to every RDD of the source DStream. This can is used to does arbitrary RDD operations on the DStream. |
Updatestatebykey (func) |
Return a new ' state ' DStream where the state for each key was updated by applying the given function on the previous state The key and the new values for the key. This can is used to maintain arbitrary state data for each key. |
Special transformations. 1 . The Updatestatebykey Operationupdatestatebykey primitive is used to record history, and the attribute is used in the word count example above. If you do not use Updatestatebykey to update the status, then each time the data comes in after the analysis is completed, the result output will not be saved 2. The Transform operationtransform primitive allows arbitrary rdd-to-rdd functions to be executed on the Dstream. This function makes it easy to extend the spark API. In addition, MLlib (machine learning) and GRAPHX are also combined by this function. 3. Window Operationswindow Operations is a bit like state in storm, you can set the size of the window and the interval of the sliding window to dynamically get the current steaming's allowed status
B, Output Operations on Dstreams:
Output Operations can export DStream data to an external database or file system, when an output Operations When the primitive is called ( the same as the RDD Action ),thestreaming program starts the real calculation process.
Output operation |
Meaning |
Print () |
Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. |
Saveastextfiles (prefix, [suffix]) |
Save This DStream ' s contents as text files. The file name at each batch interval are generated based on prefix and suffix: "Prefix-time_in_ms[.suffix]". |
Saveasobjectfiles (prefix, [suffix]) |
Save This DStream ' s contents as sequencefiles of serialized Java objects. The file name at each batch interval are generated based on prefix and suffix: "Prefix-time_in_ms[.suffix]". |
Saveashadoopfiles (prefix, [suffix]) |
Save This DStream ' s contents as Hadoop files. The file name at each batch interval are generated based on prefix and suffix: "Prefix-time_in_ms[.suffix]". |
Foreachrdd (func) |
The most generic output operator This applies a function, Func, to each of the RDD generated from the stream. This function should push the external system, such as saving the "Rdd to files," or writing it over The network to a database. Note that the function func are executed in the driver process running the streaming application, and would usually have the RDD Actions in it that would force the computation of the streaming RDDs. |
4. Spark Streaming Practice using:
Read the data in real time from the socket, perform real-time processing, and first Test if the NC is installed:
Then check whether the installation is: [[email protected] hadoop]# which NC
Then install NC:[[email protected] hadoop]# Yum install-y NC (This method installation error, not recommended)
[Email protected] hadoop]# wget http://vault.centos.org/6.6/os/x86_64/Packages/nc-1.84-22.el6.x86_64.rpm
[Email protected] hadoop]# RPM-IUV nc-1.84-22.el6.x86_64.rpm
Then execute the following command in the window: [[email protected] hadoop]# nc-lk 9999 (input message).
Then copy this window and execute the following command: [[email protected] hadoop]# NC slaver1 9999 (can accept input messages).
5. Start testing:
[Email protected] ~]$ NC-LK 9999
[Email protected] spark-1.5.1-bin-hadoop2.4]$/bin/run-example streaming.networkwordcount 192.168.19.131 9999
Then in the first line of the window, enter for example: Hello World, world of Hadoop world, Spark World, Flume world, Hello World
See if the second row of the window is counted;
1. Spark SQL and DataFrame
A, what is spark SQL?
Spark SQL is a module that spark uses to process structured data, which provides a programmatic abstraction called dataframe and acts as a distributed SQL query engine.
B, why study spark SQL?
We have learned hive, which is to convert hive SQL to MapReduce and then commit to the cluster to execute, greatly simplifying the complexity of the program that writes MapReduce, because the computational model of MapReduce is more efficient to execute. All spark SQL came into being, it was converting spark SQL into an rdd, and then committing to the cluster execution, the execution was very efficient!
C, Spark Features:
Easy to integrate, unified data access, compatible hive, standard data connection.
D, what is Dataframes?
Like the Rdd, Dataframe is also a distributed data container. However, Dataframe is more like a two-dimensional table in a traditional database, in addition to data, it also records the structure information of the data, that is, schema. Also, similar to hive, Dataframe supports nested data types (struct, array, and map). From the point of view of API usability, the DataFrame API provides a high level of relational operations that are more user-friendly and lower-threshold than the functional RDD API. Similar to the dataframe of R and Pandas, Spark Dataframe well inherits the development experience of traditional stand-alone data analysis.
2, create Dataframes?
In Spark SQL SqlContext is the portal to create Dataframes and execute SQL, and a sqlcontext is already built into spark-1.5.2:
1Create a file locally, with three columns, ID, name, age, separated by a space, and uploaded to HDFs on HDFs DFS-put Person.txt/2in the spark shell, execute the following command to read the data and split the data for each row with the column delimiter Val linerdd= Sc.textfile ("Hdfs://node1.itcast.cn:9000/person.txt"). Map (_.split (" "))3. Define Caseclass(equivalent to the table schema) Case classPerson (Id:int, name:string, Age:int)4. Associate the Rdd and case class with Val Personrdd= Linerdd.map (x = = Person (x (0). ToInt, X (1), X (2). ToInt))5. Convert the Rdd to Dataframeval persondf=PERSONRDD.TODF6. Processing the dataframe persondf.show
3, Dataframe common operation:
DSL style Syntax//see what's in DataframePersondf.show//View the contents of the Dataframe section columnPersondf.Select(Persondf.col ("name"). SHOWPERSONDF .Select(Col ("name"), Col (" Age"). SHOWPERSONDF .Select("name"). Show//Print schema information for dataframePersondf.printschema//query all the name and age, and will age+1Persondf.Select(Col ("ID"), Col ("name"), Col (" Age") +1). showpersondf.Select(PERSONDF ("ID"), PERSONDF ("name"), PERSONDF (" Age") +1). Show
Filter age greater than or equal to 18
Persondf.filter (Col ("Age") >=). Show
Group BY age and count the number of people of the same age
Persondf.groupby ("Age"). Count (). Show ()
4. SQL style Syntax:
If you want to use SQL-style syntax, you need to register dataframe as a table persondf.registertemptable ("t_person") // Query the oldest of the top two sqlcontext.sql ("select * from T_person ORDER BY age desc limit 2 " ). Show// Display table schema information sqlcontext.sql ("desc t_person" ). Show
Cond......
Spark's streaming and Spark's SQL easy start learning