Case Introduction and programming implementation
1. Case Introduction
In this case, we assume that a forum needs to be based on the user clicks on the Web page, stay time, and whether or not to praise, to the near real-time calculation of the Web page heat, and then dynamic update the site today's hotspot module, the hottest topic of the link display.
2. Case study
For a user who accesses a forum, we need to make an abstraction of his behavioral data, so that we can explain the process of calculating the heat of a web topic.
First, we use a vector to define the user's behavior on a Web page, the time of stay, and whether or not to point to praise, can be expressed as follows:
(page001.html, 1, 0.5, 1)
The first item of the vector represents the ID of the Web page, the second item indicates the number of clicks from the site to the page, and the third indicates the stay time, in minutes, and the fourth item is whether it is a point of praise, 1 for the praise, 1 for the step, and 0 for neutrality.
Second, we follow each behavior to calculate the Web page topic heat contribution, give it a weight, in this article, we assume that the weight of clicks is 0.8, because the user may be because there is no other better topic, so revisit this topic. The dwell time weight is 0.8, because the user may open multiple tab pages at the same time, but he is really focused on one of the topics. Whether the point of praise weight is 1, because this generally means that the user is interested in the topic of the Web page.
Finally, we define the contribution of a behavior data to the web's heat by using the following formula.
F (x,y,z) =0.8x+0.8y+z
So for the above behavioral data (page001.html, 1, 0.5, 1), you can use the formula to:
H (page001) =f (x,y,z) = 0.8x+0.8y+z=0.8*1+0.8*0.5+1*1=2.2
Readers can note that in this process, we ignore the user itself, that is, we do not pay attention to who the user is, but only to focus on its contribution to the heat of the web.
3. Production Behavior Data message
In this case we will use a program to simulate user behavior. The program randomly pushes 0 to 50 behavior data messages to the User-behavior-topic theme every 5 seconds, and obviously this program plays the role of the message producer, which is typically provided by a system in practical applications. To simplify message processing, we define the format of the message as follows:
Page id| clicks | Stay time (minutes) | Is it a bit of a compliment?
and assume that the site has only 100 pages. Here is the Scala implementation source for this class.
Userbehaviormsgproducer Type Source code
Import scala.util.Random Import java.util.Properties import kafka.producer.KeyedMessage Import Kafka.producer.ProducerConfig Import Kafka.producer.Producer class Userbehaviormsgproducer (brokers:string, topic: String) extends Runnable {private Val brokerlist = Brokers private val targettopic = Topic private Val props = new Pro Perties () props.put ("Metadata.broker.list", This.brokerlist) props.put ("Serializer.class", " Kafka.serializer.StringEncoder ") props.put (" Producer.type "," async ") private val config = new Producerconfig ( This.props) Private Val producer = new producer[string, String] (this.config) private Val page_num = private Val MA X_msg_num = 3 private val max_click_time = 5 private Val max_stay_time =//like,1;dislike-1; No feeling 0 private val like_or_not = Array[int] (1, 0,-1) def run (): unit = {val Rand = new Random () while (true) {//how Many user behavior messages'll be produced Val msgnum = Rand.nextint (max_msg_num) + 1 try {//generate thE message with format like page1|2|7.123|1 for (i <-0 to Msgnum) {var msg = new StringBuilder () msg.append ("page"
+ (Rand.nextint (page_num) + 1)) Msg.append ("|")
Msg.append (Rand.nextint (max_click_time) + 1) msg.append ("|")
Msg.append (Rand.nextint (max_click_time) + rand.nextfloat ()) Msg.append ("|") Msg.append (Like_or_not (Rand.nextint (3))) println (Msg.tostring ())//send the generated message to broker SendMessage (MS G.tostring ())} println ("%d user behavior messages produced."). Format (msgnum+1)} catch {case e:exception => println (e)} try {//sleep to 5 seconds after send a micro batch of message Thread.Sleep (5000)} catch {case e:exception => println (E)}} def SendMessage (message:string)
= {try {val data = new keyedmessage[string, String] (this.topic, message);
Producer.send (data); Catch {case E:exception => println (E)}} object Userbehaviormsgproducerclient {def main (args:array[string]) {if (Args.length < 2) {PrinTLN ("Usage:userbehaviormsgproducerclient 192.168.1.1:9092 user-behavior-topic") system.exit (1)}//start the message P Roducer thread new Thread (new Userbehaviormsgproducer (args (0), args (1)). Start ()}}
4. Write Spark Streaming program consumer message
After figuring out the problem to solve, you can start coding. For the problems in this case, the basic steps to implement are as follows:
Build the StreamingContext instance of Spark and turn on the checkpoint function. Because we need to use the Updatestatebykey primitives to update the topic of the Web page heat value.
This method returns Receiverinputdstream object instances, using the Kafkautils.createstream method provided by Spark to consume message topics.
For each message, use the formula above to calculate the heat value of the topic on the Web page.
Define an anonymous function to add the last calculated value of the Web page heat to the new calculated value, and to get the latest heat value.
Call the Updatestatebykey primitive and pass in the anonymous function defined above to update the Web page heat value.
Finally, after the latest results, you need to sort the results, and finally print the maximum heat value of the 10 pages.
The source code is as follows.
Webpagepopularityvaluecalculator Type Source code
Import org.apache.spark.SparkConf Import org.apache.spark.streaming.Seconds Import Org.apache.spark.streaming.StreamingContext Import org.apache.spark.streaming.kafka.KafkaUtils Import Org.apache.spark.HashPartitioner Import Org.apache.spark.streaming.Duration Object Webpagepopularityvaluecalculator {private Val checkpointdir = "Popularity-data-checkpoint" Private Val Msgconsumergrou p = "User-behavior-topic-message-consumer-group" def Main (args:array[string]) {if (Args.length < 2) {println ("U Sage:webpagepopularityvaluecalculator zkserver1:2181, zkserver2:2181,zkserver3:2181 consumeMsgDataTime Interval (secs) ") System.exit (1)} Val Array (zkservers,processinginterval) = args val conf = new sparkconf (). Setappname ("Web Page popularity Value Calculator") val ssc = new StreamingContext (conf, Seconds (processinginterval.toint))//usin G Updatestatebykey asks for enabling Checkpoint Ssc.checkpoint (Checkpointdir) Val kafkastream = Kafkautils.createstream (
Spark Streaming context SSC,//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,... zkservers,//kafka message consumer group ID Msgconsumergroup,//map of (topic_n Ame-> numpartitions) to consume.
Each partition was consumed in its own thread Map ("User-behavior-topic"-> 3) Val Msgdatardd = Kafkastream.map (_._2) For debug to//println ("Coming data in this interval ...")//msgdatardd.print ()//e.g Page37|5|1.5119122|-1 V
Al Popularitydata = msgdatardd.map {msgline => {val dataarr:array[string] = msgline.split ("\\|") Val PageID = Dataarr (0)//calculate The popularity value val popvalue:double = Dataarr (1). tofloat * 0.8 + Dataarr (2). To Float * 0.8 + Dataarr (3). Tofloat * 1 (PageID, Popvalue)}}//sum the previous popularity value and current value Val
Updatepopularityvalue = (iterator:iterator[(String, seq[double], option[double])] => {iterator.flatmap (t => { Val newvalue:double = t._2.sum val statevalue:double = T._3.getorelse(0); Some (newvalue + statevalue)}.map (Sumedvalue => (t._1, Sumedvalue))} val Initialrdd = Ssc.sparkContext.parallelize (List ("Page1", 0.00)) Val Statedstream = popularitydata.updatestatebykey[double] (updatepopularityvalue, new Hashpartitioner (Ssc.sparkContext.defaultParallelism), True, Initialrdd)//set the checkpoint interval to avoid too ently data checkpoint which may//may significantly reduce operation throughput Statedstream.checkpoint (Duration (8*proce ssinginterval.toint*1000)//after calculation, we need to sort the "result" and "show" the Top of the hot pages statedstre Am.foreachrdd {rdd => {val sorteddata = rdd.map{case (k,v) => (v,k)}.sortbykey (false) Val Topkdata = Sorteddat A.take. map{case (V,k) => (k,v)} topkdata.foreach (x => {println (x)})} Ssc.start () Ssc.awaittermina tion ()}}
Webpagepopularityvaluecalculator Class Start command
Bin/spark-submit \
--jars $SPARK _home/lib/spark-streaming-kafka_2.10-1.3.1.jar, \
$SPARK _home/lib/ Spark-streaming-kafka-assembly_2.10-1.3.1.jar, \
$SPARK _home/lib/kafka_2.10-0.8.2.1.jar, \
$SPARK _home/ Lib/kafka-clients-0.8.2.1.jar \
--class com.ibm.spark.exercise.streaming.WebPagePopularityValueCalculator
--master spark://<spark_master_ip>:7077 \
--num-executors 4 \
--driver-memory 4g \
- Executor-memory 2g \
--executor-cores 2 \
/home/fams/sparkexercise.jar \
192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 2
Because we want to use or indirectly invoke the Kafka API in the program, and we need to invoke the Spark streaming integration Kafka API (Kafkautils.createstream), we need to upload the jar packets from the launch command in advance to Spark On every machine in the cluster (in this case we upload them to the Lib directory of the Spark installation directory, or $spark_home/lib) and reference them in the startup command.