Sparkstreaming combined with Sparksql example

Last Update:2017-10-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The use of Sparksql combined with sparkstreaming

Flume+kafka+sparkstreaming has evolved into a more mature real-time log collection and computation architecture, leveraging Kafka, which can support data flow to HDFS for offline analysis while supporting multiple consumer real-time consumption data. including Sparkstreaming. However, in the Sparkstreaming program, if you have statistics on complex business logic, it is difficult to use Scala code to make it easier to understand. But if SQL is used in sparksteaming to do statistical analysis, is it simply more?

This article describes the combination of Sparksql and sparkstreaming, using SQL to complete real-time log data statistics. The sparkstreaming program runs in yarn-cluster mode on yarn and does not deploy the spark cluster separately.

Environment deployment

hadoop-2.3.0-cdh5.0.0 (YARN)

spark-1.5.0-bin-hadoop2.3

kafka_2.10-0.8.2.1

In addition, the sparkstreaming is compiled to read the Kafka data:

Spark-streaming-kafka_2.10-1.5.0.jar

Related environment Deployment This article does not introduce, please refer to the article at the end of the relevant reading.

Real-time statistical requirements

The number of Pv,ip in 60 seconds is measured in 60-second intervals, and UV

The final results include:

Point in time: Pv:ips:uv

Original log format

2015-11-11T14:59:59|~|xxx|~|202.109.201.181|~|xxx|~|xxx|~|xxx|~|B5c96dca0003db546e72015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|b1611d0e000038578082015-11-11T14:59:59|~|xxx|~|125.119.144.252|~|xxx|~|xxx|~|xxx|~|1555bd0100016f2e76f2015-11-11T15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~| xxx|~| c0ea13670e0b942e70e2015-11-11t15:00:00|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| C0ea13670e0b942e70e2015-11-11T15 : 00:01|~|xxx|~| 125.119.144.252|~|xxx|~|xxx|~|xxx|~| 4E3512790001039FDB9

Each log contains 7 fields, with a delimiter of |~|, where the 3rd column is IP and the 7th column is Cookieid. Assume that the original log has been streamed from Flume to Kafka.

Sparkstreaming Program code

The following SQL statement is used in the program to complete data statistics for a batch:

Date_format (current_timestamp (), Time,count (as PV, Count (as IPs,count (from Daplog

Sparkstreaming Program code:

Package Com.lxw.testImport Scala.reflect.runtime.universeImport Org.apache.spark.SparkconfImport Org.apache.spark.SparkcontextImport Org.apache.spark.rdd.RDDImport Org.apache.spark.sql.SqlContextImport Org.apache.spark.storage.StoragelevelImport org.apache.spark.streaming.SecondsImport org.apache.spark.streaming.StreamingContextImport org.apache.spark.streaming.TimeImport Org.apache.spark.streaming.kafka.Kafkautils/** * auth:lxw1234 * http://lxw1234.com * *Objectdaplogstreaming {DefMain (args:array[String]) {Val sparkconf =NewSparkconf (). Setmaster ("Yarn-cluster"). Setappname ("Daplogstreaming")One batch per 60 secondsVal SSC =NewStreamingContext (sparkconf,Seconds (60))Read data from Kafka, topic is Daplog, the topic contains two partitionsVal Kafkastream =Kafkautils.createstream (SSC,"Bj11-65:2181",Zookeeper used by Kafka clusters"Group_spark_streaming",The group.id used by the consumermap[String,INT] ("Daplog"0,"Daplog"1),Topic of log in Kafka and its partitionStoragelevel.memory_and_disk_ser). Map (x = X._2.split ("\\|~\\|",-1))The log is delimited with |~| Kafkastream.foreachrdd (RDD:rdd[array[String]], time:Time) = {Val SqlContext =Sqlcontextsingleton.getinstance (Rdd.sparkcontext)Import Sqlcontext.implicits._Construct case Class:daplog to extract the corresponding fields in the logVal logdataframe = rdd.map (w =Daplog (W (0). substring (0,), W (2), W (6)). TODF ()Registered as TempTable logdataframe.registertemptable ("Daplog")Query the number of PV,IP for this batch, UVVal logcountsdataframe = Sqlcontext.sql ("Select Date_format (Current_timestamp (), ' Yyyy-mm-dd HH:mm:ss ') as Time,count (1) as Pv,count (distinct IP) as Ips,count ( Distinct Cookieid) as UV from Daplog ")Print query result Logcountsdataframe.show ()}) Ssc.start () Ssc.awaittermination ()}}Caseclass daplog (day: string, Ip:string, Cookieid:String) object sqlcontextsingleton { @transient private var instance: sqlcontext = _ def  getinstance (sparkcontext: sparkcontext): SQLContext = {if (instance = = null) {instance = new sqlcontext (Sparkcontext)} instance}}

In the example, only the results of real-time statistics are printed to standard output, and real-world scenarios typically persist the results to the database.

Package the program into a daplogstreaming.jar and upload it to the Gateway machine.

Running the sparkstreaming program

Enter $spark_home/bin execute the following command to submit the Sparkstreaming program to yarn:

/spark-submit--class com.lxw.test.DapLogStreaming--master yarn-cluster-- Executor-memory 2G--num-executors 6--jars/home/liuxiaowen/kafka-clients-0.8. 2.1.jar,/home/liuxiaowen/ Metrics-core-2.2.0.jar,/home/liuxiaowen/zkclient-0.3.jar,< Span class= "Hljs-regexp" >/home/liuxiaowen/spark-streaming-kafka_2.10-1.5.0.jar,/home< Span class= "Hljs-regexp" >/liuxiaowen/kafka_2. 10-0.8. 2.1.jar /home/liuxiaowen/daplogstreaming.jar

Note: The sparkstreaming and Kafka plugins need to rely on the appropriate jar package when running.

View Run Results

Enter Yarn ResourceManager web interface, find the corresponding application of the program, click the Applicationmaster link, enter the Sparkmaster interface:

For each batch (60 seconds), a job is generated.

Click tab "Streaming" to go to the streaming monitoring page:

At the bottom, displays the batches being processed and the batches that have been completed, including the number of events per batch.

Finally, most notably, click on the Applicationmaster logs link to view the STDOUT standard output:

The statistics have been printed according to the fields in SQL, and one batch is printed every 60 seconds.

Precautions

Since kafka_2.10-0.8.2.1 is based on Scala2.10, Spark, Spark's Kafka plug-in, Sparkstreaming applications need to use Scala2.10, and if you use Scala2.11, the runtime will report errors caused by the Scala version inconsistency, such as:

15/11/1115:36:ERROR yarn. Applicationmaster:userClass threw Exception:java.lang.NoSuchMethodError:scala. predef$. ARROWASSOC (ljava/lang/Object;) ljava/lang/object;java.lang.nosuchmethoderror:scala. predef$. ARROWASSOC (Ljava/lang/object;) ljava/lang/Object;at Org.apache.spark.streaming.kafka.kafkautils$.createstream (Kafkautils.scala:59) at Com.lxw.test.daplogstreaming$.main (Daplogstreaming.scala:23) at Com.lxw.test.DapLogStreaming.main (Daplogstreaming.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java: ) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java: 43) at Java.lang.reflect.Method.invoke (Method.java:606) at org.apache.spark.deploy.yarn.applicationmaster$ $anon $2.run (ApplicationMaster.sc

Sparkstreaming with Sparksql instances

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Sparkstreaming combined with Sparksql example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Sparkstreaming combined with Sparksql example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support