The use of Sparksql combined with sparkstreaming
Flume+kafka+sparkstreaming has evolved into a more mature real-time log collection and computation architecture, leveraging Kafka, which can support data flow to HDFS for offline analysis while supporting multiple consumer real-time consumption data. including Sparkstreaming. However, in the Sparkstreaming program, if you have statistics on complex business logic, it is difficult to use Scala code to make it easier to understand. But if SQL is used in sparksteaming to do statistical analysis, is it simply more?
This article describes the combination of Sparksql and sparkstreaming, using SQL to complete real-time log data statistics. The sparkstreaming program runs in yarn-cluster mode on yarn and does not deploy the spark cluster separately.
Environment deployment
hadoop-2.3.0-cdh5.0.0 (YARN)
spark-1.5.0-bin-hadoop2.3
kafka_2.10-0.8.2.1
In addition, the sparkstreaming is compiled to read the Kafka data:
Spark-streaming-kafka_2.10-1.5.0.jar
Related environment Deployment This article does not introduce, please refer to the article at the end of the relevant reading.
Real-time statistical requirements
The number of Pv,ip in 60 seconds is measured in 60-second intervals, and UV
The final results include:
Point in time: Pv:ips:uv
Original log format
2015-11-11T14:59:59|~|xxx|~|202.109.201.181|~|xxx|~|xxx|~|xxx|~|B5c96dca0003db546e72015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|b1611d0e000038578082015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|1555bd0100016f2e76f2015-11-11T15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~| xxx|~| c0ea13670e0b942e70e2015-11-11t15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| C0ea13670e0b942e70e2015-11-11T15 : 00:01|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| 4E3512790001039FDB9
Each log contains 7 fields, with a delimiter of |~|, where the 3rd column is IP and the 7th column is Cookieid. Assume that the original log has been streamed from Flume to Kafka.
Sparkstreaming Program code
The following SQL statement is used in the program to complete data statistics for a batch:
Date_format (current_timestamp (), Time,count (as PV, Count (as IPs,count (from Daplog
Sparkstreaming Program code:
Package Com.lxw.testImport Scala.reflect.runtime.universeImport Org.apache.spark.SparkconfImport Org.apache.spark.SparkcontextImport Org.apache.spark.rdd.RDDImport Org.apache.spark.sql.SqlContextImport Org.apache.spark.storage.StoragelevelImport org.apache.spark.streaming.SecondsImport org.apache.spark.streaming.StreamingContextImport org.apache.spark.streaming.TimeImport Org.apache.spark.streaming.kafka.Kafkautils/** * auth:lxw1234 * http://lxw1234.com * *Objectdaplogstreaming {DefMain (args:array[String]) {Val sparkconf =NewSparkconf (). Setmaster ("Yarn-cluster"). Setappname ("Daplogstreaming")One batch per 60 secondsVal SSC =NewStreamingContext (sparkconf,Seconds (60))Read data from Kafka, topic is Daplog, the topic contains two partitionsVal Kafkastream =Kafkautils.createstream (SSC,"Bj11-65:2181",Zookeeper used by Kafka clusters"Group_spark_streaming",The group.id used by the consumermap[String,INT] ("Daplog"0,"Daplog"1),Topic of log in Kafka and its partitionStoragelevel.memory_and_disk_ser). Map (x = X._2.split ("\\|~\\|",-1))The log is delimited with |~| Kafkastream.foreachrdd (RDD:rdd[array[String]], time:Time) = {Val SqlContext =Sqlcontextsingleton.getinstance (Rdd.sparkcontext)Import Sqlcontext.implicits._Construct case Class:daplog to extract the corresponding fields in the logVal logdataframe = rdd.map (w =Daplog (W (0). substring (0,), W (2), W (6)). TODF ()Registered as TempTable logdataframe.registertemptable ("Daplog")Query the number of PV,IP for this batch, UVVal logcountsdataframe = Sqlcontext.sql ("Select Date_format (Current_timestamp (), ' Yyyy-mm-dd HH:mm:ss ') as Time,count (1) as Pv,count (distinct IP) as Ips,count ( Distinct Cookieid) as UV from Daplog ")Print query result Logcountsdataframe.show ()}) Ssc.start () Ssc.awaittermination ()}}Caseclass daplog (day: string, Ip:string, Cookieid:String) object sqlcontextsingleton { @transient private var instance: sqlcontext = _ def getinstance (sparkcontext: sparkcontext): SQLContext = {if (instance = = null) {instance = new sqlcontext (Sparkcontext)} instance}}
In the example, only the results of real-time statistics are printed to standard output, and real-world scenarios typically persist the results to the database.
Package the program into a daplogstreaming.jar and upload it to the Gateway machine.
Running the sparkstreaming program
Enter $spark_home/bin execute the following command to submit the Sparkstreaming program to yarn:
/spark-submit--class com.lxw.test.DapLogStreaming--master yarn-cluster-- Executor-memory 2G--num-executors 6--jars/home/liuxiaowen/kafka-clients-0.8. 2.1.jar,/home/liuxiaowen/ Metrics-core-2.2.0.jar,/home/liuxiaowen/zkclient-0.3.jar,< Span class= "Hljs-regexp" >/home/liuxiaowen/spark-streaming-kafka_2.10-1.5.0.jar,/home< Span class= "Hljs-regexp" >/liuxiaowen/kafka_2. 10-0.8. 2.1.jar /home/liuxiaowen/daplogstreaming.jar
Note: The sparkstreaming and Kafka plugins need to rely on the appropriate jar package when running.
View Run Results
Enter Yarn ResourceManager web interface, find the corresponding application of the program, click the Applicationmaster link, enter the Sparkmaster interface:
For each batch (60 seconds), a job is generated.
Click tab "Streaming" to go to the streaming monitoring page:
At the bottom, displays the batches being processed and the batches that have been completed, including the number of events per batch.
Finally, most notably, click on the Applicationmaster logs link to view the STDOUT standard output:
The statistics have been printed according to the fields in SQL, and one batch is printed every 60 seconds.
Precautions
Since kafka_2.10-0.8.2.1 is based on Scala2.10, Spark, Spark's Kafka plug-in, Sparkstreaming applications need to use Scala2.10, and if you use Scala2.11, the runtime will report errors caused by the Scala version inconsistency, such as:
15/11/1115:36:ERROR yarn. Applicationmaster:userClass threw Exception:java.lang.NoSuchMethodError:scala. predef$. ARROWASSOC (ljava/lang/Object;) ljava/lang/object;java.lang.nosuchmethoderror:scala. predef$. ARROWASSOC (Ljava/lang/object;) ljava/lang/Object;at Org.apache.spark.streaming.kafka.kafkautils$.createstream (Kafkautils.scala:59) at Com.lxw.test.daplogstreaming$.main (Daplogstreaming.scala:23) at Com.lxw.test.DapLogStreaming.main (Daplogstreaming.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java: ) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java: 43) at Java.lang.reflect.Method.invoke (Method.java:606) at org.apache.spark.deploy.yarn.applicationmaster$ $anon $2.run (ApplicationMaster.sc
Sparkstreaming with Sparksql instances