1th lesson: One of sparkstreaming kick: Decryption sparkstreaming alternative experiments and sparkstreaming essence analysis

Source: Internet
Author: User
Tags hdfs dfs

Contents of this issue:

    1. Sparkstreaming Online Alternative Experiment

    2. Instantly understand the nature of sparkstreaming


Sparkstreaming is a sub-framework on spark core, and if we can fully master a sub-framework, we'll be better able to harness spark. Sparkstreaming and Spark SQL are the most popular frameworks, and from a research point of view, there are too many problems with SQL optimization that are not well suited for deep research. Sparkstreaming, unlike other frameworks, is more like an application of Sparkcore . If we can get a deeper understanding of sparkstreaming, we can write very complex applications.

Sparkstreaming's advantages are the ability to combine sparksql, graph computing, machine learning, and more powerful features. In this era, simple flow calculation has been unable to meet the needs of customers. Sparkstreaming in Spark is also the most problematic, because it is constantly running and the interior is more complex.


This experiment is based on the program code in the following blog

The 94th lesson of the IMF course: sparkstreaming implementation of online blacklist filtering in AD billing system

http://lqding.blog.51cto.com/9123978/1769290


In order to better view the operation of the job, we start history-server

[Email protected]:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin#./start-history-server.sh

History-server failed to start, view the following information for the log report:

exception in thread  "Main"  java.lang.reflect.InvocationTargetException         at sun.reflect.nativeconstructoraccessorimpl.newinstance0 (Native  Method)         at  Sun.reflect.NativeConstructorAccessorImpl.newInstance (nativeconstructoraccessorimpl.java:62)          at sun.reflect.delegatingconstructoraccessorimpl.newinstance ( DELEGATINGCONSTRUCTORACCESSORIMPL.JAVA:45)         at  Java.lang.reflect.Constructor.newInstance (constructor.java:422)          at org.apache.spark.deploy.history.historyserver$.main (historyserver.scala:235)          at org.apache.spark.deploy.history.historyserver.main (HistoryServer.scala) Caused by: java.lang.illegalargumentexception: log directory specified does not exist: file:/tmp/spark-events. did you configure the  Correct one through spark.history.fs.logdirectory?

According to the error message, it is generally possible to see that the log directory does not exist. Create a Directory

[Email protected]:/tmp# HDFs dfs-mkdir/historyserverforspark/

Configure spark-env.sh, add an environment variable, and let the logdirectory of history server point to the directory created above

Export spark_history_opts= "-dspark.history.fs.logdirectory=hdfs://spark-master:8020/historyserverforspark"

To configure spark-defaults.conf, add the following configuration items:

#是否记录作业产生的事件或者运行状态 (such as job,stage, etc.) spark.eventLog.enabled true# if the event or run state of the job is logged, the event is written to where Spark.eventlog. Dir hdfs://spark-master:8020/historyserverforspark#http history's listening port number, accessed through http://hadoop.master:18080 Spark.hi Story.ui.port 18080#spark History Log Location Park.history.fs.logdirectory=hdfs://spark-master:8020/historyserverforspar K


Start history-server again, problem solved.

The Web interface is as follows:

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M00/7F/9D/wKioL1clmqviEi9XAAB8egIB4oI233.png "title=" Sogou 20160501135336.png "alt=" Wkiol1clmqviei9xaab8egib4oi233.png "/>

In order to have a clearer view of the various aspects of the streaming operation, we can set the value of the batchinterval to a greater extent. For example, 5 minutes.


Uploading programs to the spark cluster


Run the Spark program

[Email protected]:~#/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit--class Com.dt.spark.streaming.OnlineBlackListFilter--master spark://spark-master:7077./spark.jar


Open Netcat and send some data with the following:

[Email protected]:~# nc-lk 9999134343 Hadoop343434 spark3432777 Java0983743 Hbase893434 Mathou


The program input results are:

16/05/01 14:00:01 INFO Scheduler. Dagscheduler:job 3 Finished:print at onlineblacklistfilter.scala:63, took 0.098316 S-------------------------------------------time:1462082400000 Ms-------------------------------------------3432777 Java343434 spark0983743 Hbase


After the calculation results, the Sparkstreaming program is terminated.


Next, we look at the content in the Web UI to resolve the sparkstreaming process.

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7F/9E/wKioL1clr2iwMIwxAACkuvCMz_w264.png "title=" Sogou 20160501152256.png "alt=" Wkiol1clr2iwmiwxaackuvcmz_w264.png "/>

The red part is the log of the program we just ran (at the first run, I couldn't see the log in completed application, where the log was shown in show incomplete applications, but the program has exited.) )


We click to see more information:

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/7F/A0/wKiom1clszCBk6vWAAChq2--MFs654.png "title=" Sogou 20160501154234.png "alt=" wkiom1clszcbk6vwaachq2--mfs654.png "/> We can see that this program is running during the launch of 4 jobs.

First look at job ID 0 for more information

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M00/7F/A0/wKiom1cls7LAq0SmAAC4F6S2pnk219.png "title=" Sogou 20160501154444.png "alt=" wkiom1cls7laq0smaac4f6s2pnk219.png "/> This job, it is clear that we define the generation of BLACKLISTRDD data. The corresponding code is

Val blacklist = Array (("Hadoop", True), ("Mathou", True))//Put Array into Rdd val blacklistrdd = Ssc.sparkContext.parallel Ize (blacklist)

And it does the Reducebykey operation (the code does not have this step, the sparkstreaming framework builds itself).

There are two stage,stage 0 and Stage 1.


Now let's look at job 1 for more information

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/7F/A0/wKiom1cltOvTNJBuAAB3avgqdqQ885.png "title=" Sogou 20160501155000.png "alt=" Wkiom1cltovtnjbuaab3avgqdqq885.png "/> Here is also a makerdd, this RDD is receiver continuously receives data in the data stream, After the time interval reaches batchinterval, all data becomes an RDD. and its time-consuming is also the longest, 59s.


Special Note: Receiver is also an independent job here as you can see. From this we can conclude that we can start multiple jobs in the application, and we can work with each other without the job, which lays the groundwork for us to write complex applications.

We click on the start at onlineblacklistfilter.scala:64 above to see more information

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/7F/9E/wKioL1cluNrB0MjIAACXFI0Oxus071.png "title=" Sogou 20160501160315.png "alt=" Wkiol1clunrb0mjiaacxfi0oxus071.png "/> According to information, only one executor in receiving data, The most important thing is that the data locality in the red Box is process_local, which means that the receiver receives the data and saves it in memory , as long as the memory is sufficient, it is not written into memory.

The specified storage default policy is memory_and_disk_ser_2 even when receiver is created

def sockettextstream (hostname:string, port:int, storagelevel:storagelevel = Storagelevel.memory_and_disk_ser_ 2): receiverinputdstream[string] = withnamedscope ("Socket text stream") {socketstream[string] (hostname, port, Socketre Ceiver.bytestolines, Storagelevel)}

Let's look at job 2 for more information:

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/7F/A0/wKiom1cluUPxKWqMAACRX0pMYDQ200.png "title=" Sogou 20160501160816.png "alt=" Wkiom1cluupxkwqmaacrx0pmydq200.png "/>650) this.width=650;" Src= "http://s5.51cto.com /wyfs02/m01/7f/9e/wkiol1cluh-cwwxsaabsd-iam8q794.png "title=" Sogou 20160501160828.png "alt=" Wkiol1cluh-cwwxsaabsd-iam8q794.png "/>

Job 2 Leftouterjoin operation of the RDD generated by the first two jobs.

From the stage ID number you can see that it is dependent on the two jobs.

Receiver receives the data on the Spark-master node, but when job 2 processes the data, the data is already on the spark-worker1 (because my environment has only two workers and the data is not scattered across all the worker nodes, If the worker node is a bit more, the situation may be different, each node will process the data)

Click on the stage Id 3 above to view more information:

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M02/7F/A0/wKiom1clu16R5tQ8AAEFYt3nMIA020.png "title=" Sogou 20160501161728.png "alt=" Wkiom1clu16r5tq8aaefyt3nmia020.png "/> runs on a executor and has 5 tasks.


Let's look at job 3 for more information:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/7F/A0/wKiom1clwVbifxIvAACR0hSX05c065.png "title=" Sogou 20160501164254.png "alt=" Wkiom1clwvbifxivaacr0hsx05c065.png "/>650) this.width=650;" Src= "http://s4.51cto.com /wyfs02/m02/7f/9e/wkiol1clwktsccf5aabq9lo3qn0378.png "title=" Sogou 20160501164324.png "alt=" Wkiol1clwktsccf5aabq9lo3qn0378.png "/>

The DAG diagram here is the same as the JOB2, but stage 6 and 7 are skipped. For detailed reasons, we will explain each of the following lessons.


Summary: We can see that a batchinterval does not trigger a job alone.


According to the above description, we have a more detailed understanding of the relationship between Dstream and Rdd. Dstream is a batchinterval time of the rdd composition. Just dstream with the time dimension, which is a borderless collection.

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M00/7F/9E/wKioL1clxN7xd9eTAAGVvHVj6p8322.png "title=" Sogou 20160501165412.png "alt=" Wkiol1clxn7xd9etaagvvhvj6p8322.png "/>



1th lesson: One of sparkstreaming kick: Decryption sparkstreaming alternative experiments and sparkstreaming essence analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.