A thorough understanding of sparkstreaming through the case

Last Update:2016-05-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

1 Spark streaming Alternative online experiment

2 instantly understand the nature of spark streaming

In the stream processing era, Sparkstreaming has a strong appeal, and development prospects, coupled with Spark's ecosystem, streaming can easily call other powerful frameworks such as Sql,mllib, it will eminence. It is also a general trend to choose spark streaming as a starting point for custom versions.

Tip: The batch interval magnification, equivalent to see the streaming version of the slow release, you can better understand its various links, here to blacklist filter program for example, to test

Case Source

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 666768697071 package com.dt.spark.sparksteamingimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.rdd.RDDimport org.apache.spark.streaming.StreamingContextimport org.apache.spark.streaming.Seconds/** * 使用Scala开发集群运行的Spark 在线黑名单过滤程序 * * 背景描述：在广告点击计费系统中，我们在线过滤掉黑名单的点击，进而保护广告商的利益，只进行有效的广告点击计费 * 或者在防刷评分（或者流量）系统，过滤掉无效的投票或者评分或者流量； * 实现技术：使用transform Api直接基于RDD编程，进行join操作 * */object OnlineBlackListFilter { def main(args: Array[String]){ /** * 第1步：创建Spark的配置对象SparkConf，设置Spark程序的运行时的配置信息 */ val conf = new SparkConf() //创建SparkConf对象 conf.setAppName("OnlineBlackListFilter") //设置应用程序的名称，在程序运行的监控界面可以看到名称 conf.setMaster("spark://Master:7077") //此时，程序在Spark集群 val ssc = new StreamingContext(conf, Seconds(30))//这里可以将Batch interval调整到更大，例如300秒，以便更好的了解Streaming内幕 /** * 黑名单数据准备，实际上黑名单一般都是动态的，例如在Redis或者数据库中，黑名单的生成往往有复杂的业务 * 逻辑，具体情况算法不同，但是在Spark Streaming进行处理的时候每次都能工访问完整的信息 */ val blackList = Array(("hadoop", true),("mahout", true)) val blackListRDD = ssc.sparkContext.parallelize(blackList, 8) val adsClickStream = ssc.socketTextStream("Master", 9999) /** * 此处模拟的广告点击的每条数据的格式为：time、name * 此处map操作的结果是name、（time，name）的格式 */ val adsClickStreamFormatted = adsClickStream.map { ads => (ads.split(" ")(1), ads) } adsClickStreamFormatted.transform(userClickRDD => { //通过leftOuterJoin操作既保留了左侧用户广告点击内容的RDD的所有内容，又获得了相应点击内容是否在黑名单中 val joinedBlackListRDD = userClickRDD.leftOuterJoin(blackListRDD) /** * 进行filter过滤的时候，其输入元素是一个Tuple：（name,((time,name), boolean)） * 其中第一个元素是黑名单的名称，第二元素的第二个元素是进行leftOuterJoin的时候是否存在在值 * 如果存在的话，表面当前广告点击是黑名单，需要过滤掉，否则的话则是有效点击内容； */ val validClicked = joinedBlackListRDD.filter(joinedItem => { if(joinedItem._2._2.getOrElse(false)) { false } else { true } }) validClicked.map(validClick => {validClick._2._1}) }).print /** * 计算后的有效数据一般都会写入Kafka中，下游的计费系统会从kafka中pull到有效数据进行计费 */ ssc.start() ssc.awaitTermination() }}

Run analysis

Launch the HDFs and spark clusters and turn on historyserver, and put the above code into a jar package into the/root/documents/sparkapps/directory, for convenience here or named WordCount. Edit the script file wordcount.sh as follows:

1	`/usr/local/spark-1.6.1-bin-hadoop2.4/bin/spark-submit--class com.pzw.spark.OnlineBlackListFilter --master spark://Master:7077` `/root/Documents/SparkApps/WordCount.jar`

Run the script file

Note that you need to start nc-lk 9999 or you'll get an error.

Go to Sparkui to see the job after execution

Click to enter DAG map

As you can see from the DAG diagram, this is not the logic of the application, which means that spark streaming itself is more of an application, and it automatically starts some jobs when it starts, and executes several job. When you enter details, you will find that a receiver is receiving the data, and a task is running 1.5min, and opening historyserver will find that the entire application executes 2min. This 1.5min task is receiver's receiving data in a continuous loop. As can be seen from here, the Spark streaming start receiver is started by the job, receiver accepts the data and the normal job no difference. We receive data from a single machine that can be executed on multiple machines, maximizing the use of resources. Although there are many jobs in the process, only one job is actually executed.

The Spark streaming itself is the data that flows in and generates the job in time, triggering the streaming engine that the job executes on cluster. In essence, it is a batch process that adds time dimensions. Every once in a while, there will be a number of data inflow, through the Dstream template constantly generated rdd, triggering job and processing.

A thorough understanding of sparkstreaming through the case

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More