Sparkstreaming combined with Sparksql example

Source: Internet
Author: User

The use of Sparksql combined with sparkstreaming

Flume+kafka+sparkstreaming has evolved into a more mature real-time log collection and computation architecture, leveraging Kafka, which can support data flow to HDFS for offline analysis while supporting multiple consumer real-time consumption data. including Sparkstreaming. However, in the Sparkstreaming program, if you have statistics on complex business logic, it is difficult to use Scala code to make it easier to understand. But if SQL is used in sparksteaming to do statistical analysis, is it simply more?

This article describes the combination of Sparksql and sparkstreaming, using SQL to complete real-time log data statistics. The sparkstreaming program runs in yarn-cluster mode on yarn and does not deploy the spark cluster separately.

Environment deployment

hadoop-2.3.0-cdh5.0.0 (YARN)

spark-1.5.0-bin-hadoop2.3

kafka_2.10-0.8.2.1

In addition, the sparkstreaming is compiled to read the Kafka data:

Spark-streaming-kafka_2.10-1.5.0.jar

Related environment Deployment This article does not introduce, please refer to the article at the end of the relevant reading.

Real-time statistical requirements

The number of Pv,ip in 60 seconds is measured in 60-second intervals, and UV

The final results include:

Point in time: Pv:ips:uv

Original log format
2015-11-11T14:59:59|~|xxx|~|202.109.201.181|~|xxx|~|xxx|~|xxx|~|B5c96dca0003db546e72015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|b1611d0e000038578082015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|1555bd0100016f2e76f2015-11-11T15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~| xxx|~| c0ea13670e0b942e70e2015-11-11t15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| C0ea13670e0b942e70e2015-11-11T15 : 00:01|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| 4E3512790001039FDB9               

Each log contains 7 fields, with a delimiter of |~|, where the 3rd column is IP and the 7th column is Cookieid. Assume that the original log has been streamed from Flume to Kafka.

Sparkstreaming Program code

The following SQL statement is used in the program to complete data statistics for a batch:

Date_format (current_timestamp (), Time,count (as PV, Count (as IPs,count (from Daplog       

Sparkstreaming Program code:

Package Com.lxw.testImport Scala.reflect.runtime.universeImport Org.apache.spark.SparkconfImport Org.apache.spark.SparkcontextImport Org.apache.spark.rdd.RDDImport Org.apache.spark.sql.SqlContextImport Org.apache.spark.storage.StoragelevelImport org.apache.spark.streaming.SecondsImport org.apache.spark.streaming.StreamingContextImport org.apache.spark.streaming.TimeImport Org.apache.spark.streaming.kafka.Kafkautils/** * auth:lxw1234 * http://lxw1234.com * *Objectdaplogstreaming {DefMain (args:array[String]) {Val sparkconf =NewSparkconf (). Setmaster ("Yarn-cluster"). Setappname ("Daplogstreaming")One batch per 60 secondsVal SSC =NewStreamingContext (sparkconf,Seconds (60))Read data from Kafka, topic is Daplog, the topic contains two partitionsVal Kafkastream =Kafkautils.createstream (SSC,"Bj11-65:2181",Zookeeper used by Kafka clusters"Group_spark_streaming",The group.id used by the consumermap[String,INT] ("Daplog"0,"Daplog"1),Topic of log in Kafka and its partitionStoragelevel.memory_and_disk_ser). Map (x = X._2.split ("\\|~\\|",-1))The log is delimited with |~| Kafkastream.foreachrdd (RDD:rdd[array[String]], time:Time) = {Val SqlContext =Sqlcontextsingleton.getinstance (Rdd.sparkcontext)Import Sqlcontext.implicits._Construct case Class:daplog to extract the corresponding fields in the logVal logdataframe = rdd.map (w =Daplog (W (0). substring (0,), W (2), W (6)). TODF ()Registered as TempTable logdataframe.registertemptable ("Daplog")Query the number of PV,IP for this batch, UVVal logcountsdataframe = Sqlcontext.sql ("Select Date_format (Current_timestamp (), ' Yyyy-mm-dd HH:mm:ss ') as Time,count (1) as Pv,count (distinct IP) as Ips,count ( Distinct Cookieid) as UV from Daplog ")Print query result Logcountsdataframe.show ()}) Ssc.start () Ssc.awaittermination ()}}Caseclass daplog (day: string, Ip:string, Cookieid:String) object sqlcontextsingleton { @transient private var instance: sqlcontext = _ def  getinstance (sparkcontext: sparkcontext): SQLContext = {if (instance = = null) {instance = new sqlcontext (Sparkcontext)} instance}}         

In the example, only the results of real-time statistics are printed to standard output, and real-world scenarios typically persist the results to the database.

Package the program into a daplogstreaming.jar and upload it to the Gateway machine.

Running the sparkstreaming program

Enter $spark_home/bin execute the following command to submit the Sparkstreaming program to yarn:

/spark-submit--class com.lxw.test.DapLogStreaming--master yarn-cluster-- Executor-memory 2G--num-executors 6--jars/home/liuxiaowen/kafka-clients-0.8. 2.1.jar,/home/liuxiaowen/ Metrics-core-2.2.0.jar,/home/liuxiaowen/zkclient-0.3.jar,< Span class= "Hljs-regexp" >/home/liuxiaowen/spark-streaming-kafka_2.10-1.5.0.jar,/home< Span class= "Hljs-regexp" >/liuxiaowen/kafka_2. 10-0.8. 2.1.jar /home/liuxiaowen/daplogstreaming.jar   

Note: The sparkstreaming and Kafka plugins need to rely on the appropriate jar package when running.

View Run Results

Enter Yarn ResourceManager web interface, find the corresponding application of the program, click the Applicationmaster link, enter the Sparkmaster interface:

For each batch (60 seconds), a job is generated.

Click tab "Streaming" to go to the streaming monitoring page:

At the bottom, displays the batches being processed and the batches that have been completed, including the number of events per batch.

Finally, most notably, click on the Applicationmaster logs link to view the STDOUT standard output:

The statistics have been printed according to the fields in SQL, and one batch is printed every 60 seconds.

Precautions

Since kafka_2.10-0.8.2.1 is based on Scala2.10, Spark, Spark's Kafka plug-in, Sparkstreaming applications need to use Scala2.10, and if you use Scala2.11, the runtime will report errors caused by the Scala version inconsistency, such as:

15/11/1115:36:ERROR yarn. Applicationmaster:userClass threw Exception:java.lang.NoSuchMethodError:scala. predef$. ARROWASSOC (ljava/lang/Object;) ljava/lang/object;java.lang.nosuchmethoderror:scala. predef$. ARROWASSOC (Ljava/lang/object;) ljava/lang/Object;at Org.apache.spark.streaming.kafka.kafkautils$.createstream (Kafkautils.scala:59) at Com.lxw.test.daplogstreaming$.main (Daplogstreaming.scala:23) at Com.lxw.test.DapLogStreaming.main (Daplogstreaming.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java: ) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java: 43) at Java.lang.reflect.Method.invoke (Method.java:606) at org.apache.spark.deploy.yarn.applicationmaster$ $anon $2.run (ApplicationMaster.sc 

Sparkstreaming with Sparksql instances

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.