Spark cultivation (advanced)-Spark beginners: Section 13th Spark Streaming-Spark SQL, DataFrame, and Spark Streaming
Main Content: Spark SQL, DataFrame and Spark Streaming1. Spark SQL, DataFrame and Spark Streaming
Source code direct reference: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
Import org. apache. spark. sparkConfimport org. apache. spark. sparkContextimport org. apache. spark. rdd. RDDimport org. apache. spark. streaming. {Time, Seconds, StreamingContext} import org. apache. spark. util. intParamimport org. apache. spark. SQL. SQLContextimport org. apache. spark. storage. storageLevelobject SqlNetworkWordCount {def main (args: Array [String]) {if (args. length <2) {System. err. println ("Usage: NetworkWordCount
") System. exit (1)} StreamingExamples. setStreamingLogLevels () // Create the context with a 2 second batch size val sparkConf = new SparkConf (). setAppName ("SqlNetworkWordCount "). setMaster ("local [4]") val ssc = new StreamingContext (sparkConf, Seconds (2) // Create a socket stream on target ip: port and count the // words in input stream of \ n delimited text (eg. generated by 'nc ') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. // use Socke as the data source val lines = ssc. socketTextStream (args (0), args (1 ). toInt, StorageLevel. MEMORY_AND_DISK_SER) // words DStream val words = lines. flatMap (_. split ("") // Convert RDDs of the words DStream to DataFrame and run SQL query // call the foreachRDD method to traverse RDD words in DStream. foreachRDD (rdd: RDD [String], time: Time) =>{// Get the singleton instance of SQLContext val sqlContext = SQLContextSingleton. getInstance (rdd. sparkContext) import sqlContext. implicits. _ // Convert RDD [String] to RDD [case class] to DataFrame val wordsDataFrame = rdd. map (w => Record (w )). toDF () // Register as table wordsDataFrame. registerTempTable ("words") // Do word count on table using SQL and print it val wordCountsDataFrame = sqlContext. SQL ("select word, count (*) as total from words group by word ") println (s "===========$ time ===========") wordCountsDataFrame. show ()}) ssc. start () ssc. awaitTermination () }}/ ** Case class for converting RDD to DataFrame */case class Record (word: String) /** Lazily instantiated singleton instance of SQLContext */object SQLContextSingleton {@ transient private var instance: SQLContext = _ def getInstance (sparkContext: SparkContext ): SQLContext = {if (instance = null) {instance = new SQLContext (sparkContext)} instance }}
After running the program, run the following command:
root@sparkmaster:~# nc -lk 9999Spark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big DataSpark is a fast and general cluster computing system for Big Data
Processing result:
========= 1448783840000 ms =========+---------+-----+| word|total|+---------+-----+| Spark| 12|| system| 12|| general| 12|| fast| 12|| and| 12||computing| 12|| a| 12|| is| 12|| for| 12|| Big| 12|| cluster| 12|| Data| 12|+---------+-----+========= 1448783842000 ms =========+----+-----+|word|total|+----+-----++----+-----+========= 1448783844000 ms =========+----+-----+|word|total|+----+-----++----+-----+