Main content
- Spark SQL, Dataframe, and spark streaming
1. Spark SQL, dataframe and spark streaming
SOURCE Direct reference: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming /sqlnetworkwordcount.scala
ImportOrg.apache.spark.SparkConfImportOrg.apache.spark.SparkContextImportOrg.apache.spark.rdd.RDDImportOrg.apache.spark.streaming. {time, Seconds, StreamingContext}ImportOrg.apache.spark.util.IntParamImportOrg.apache.spark.sql.SQLContextImportOrg.apache.spark.storage.StorageLevel Object sqlnetworkwordcount { defMain (args:array[string]) {if(Args.length <2) {System.err.println ("Usage:networkwordcount ) System.exit (1)} streamingexamples.setstreamingloglevels ()//Create the context with a 2 second batch size Valsparkconf =NewSparkconf (). Setappname ("Sqlnetworkwordcount"). Setmaster ("Local[4]")ValSSC =NewStreamingContext (sparkconf, Seconds (2))//Create a socket stream on target Ip:port and Count the //words in input stream of \ n delimited text (eg. generated by ' NC ') //Note that the no duplication in storage level is only for running locally. //Replication necessary in distributed scenario for fault tolerance. //socke as a data source Vallines = Ssc.sockettextstream (args (0), args (1). ToInt, Storagelevel.memory_and_disk_ser)//words DStream ValWords = Lines.flatmap (_.split (" "))//Convert RDDs of the words DStream to DataFrame and run SQL query //Call the Foreachrdd method to traverse the Rdd in the DstreamWords.foreachrdd ((rdd:rdd[string], time:time) + = {//Get The singleton instance of SqlContext ValSqlContext = Sqlcontextsingleton.getinstance (Rdd.sparkcontext)ImportSqlcontext.implicits._//Convert rdd[string] to Rdd[case class] to DataFrame ValWordsdataframe = Rdd.map (w = Record (w)). TODF ()//Register as TableWordsdataframe.registertemptable ("Words")//Do word count on table using SQL and print it ValWordcountsdataframe = Sqlcontext.sql ("Select Word, COUNT (*) as total from words GROUP by word") println (s"========= $time =========") Wordcountsdataframe.show ()}) Ssc.start () Ssc.awaittermination ()}}/** Case Class-Converting RDD to DataFrame * /Case class Record(word:string) /** lazily instantiated Singleton instance of SqlContext * * Object Sqlcontextsingleton { @transient Private varInstance:sqlcontext = _defGetInstance (sparkcontext:sparkcontext): SqlContext = {if(Instance = =NULL) {instance =NewSqlContext (Sparkcontext)} instance}}
After running the program, run the following command
[Email protected]:~# NC-LK 9999Spark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Dataspark isaFast andGeneral Cluster Computingsystem forBig Data
Processing results:
========= 1448783840000 ms =========+---------+-----+| word|total|+---------+-----+| Spark| 12|| system| 12|| general| 12|| fast| 12|| and| 12||computing| 12|| a| 12|| is| 12|| for| 12|| Big| 12|| cluster| 12|| Data| 12|+---------+-----+========= 1448783842000 ms =========+----+-----+|word|total|+----+-----++----+-----+========= 1448783844000 ms =========+----+-----+|word|total|+----+-----++----+-----+
Spark cultivation Path (advanced)--spark Getting started to Mastery: 13th Spark Streaming--spark SQL, dataframe and spark streaming