SBT build Spark streaming integrated Kafka (Scala version)

Source: Internet
Author: User

Preface:

    
Recently in the research Spark also has Kafka, wants to pass the data which the Kafka end obtains, uses the spark streaming to carry on some computation, but constructs the entire environment is really not easy, therefore hereby writes down this process, shares to everybody, hoped that everybody may take a little detour, can help everybody!

Environment Preparation:

operating system: ubuntu14.04 LTS

Hadoop 2.7.1 Pseudo-distributed construction

sbt-0.13.9

kafka_2.11-0.8.2.2

spark-1.3.1-bin-hadoop2.6

Scala version: 2.10.4

     

     Note: Please pay attention to the issue of the version, before the author is spark-1.4.1, Scala version is 2.11.7 result job submitted to Spark-submit always fail, so everyone this attention!              

     

Hadooop 2.7.1 Pseudo-distributed construction we can refer to http://www.wjxfpf.com/2015/10/517149.html

Kafka Installation and testing:

  1. To the official website http://kafka.apache.org/downloads.html Download kafka_2.11-0.8.2.2.tgz   
  2. Go to the download directory, open the terminal, enter the following command, unzip it to the/usr/local directory: sudo tar-xvzf kafka_2.11-0.8.2.2.tgz-c/usr/local
  3. After typing the user password, Kafka successfully unzip, continue to enter the following command:
      1. cd/usr/local jump to/usr/local/directory;
      2. sudo chmod 777-r kafka_2.11-0.8.2.2 Get all the execution rights of the directory; gedit ~/.bashrc Open Personal configuration end add E Xport kafka_home=/usr/local/kafka_2.11-0.8.2.2
        Export path= $PATH: $KAFKA _home/bin
      3. save, terminal input source ~/.BASHRC

Kafka has its own default zookeeper so save us some effort, now you can start testing under Kafka:

  • New terminal input CD $KAFKA _home into the KAFKA directory (we call this terminal terminal 1th for convenience)
  • bin/zookeeper-server-start.sh config/zookeeper.properties & background Run zookeeper

  • bin/kafka-server-start.sh config/server.properties & background start kafka-server 

  • bin/kafka-topics.sh--create--zookeeper localhost:2181--replication-factor 1--partitions 1--topic test Create a new topic called Test.
  • bin/kafka-console-producer.sh--broker-list localhost:9092--topic Test Kafka provides a command-line tool, You can read the message from the input file or the command line and send it to the Kafka cluster. Each line is a message.
  • Open a new terminal (for convenience, we call this Terminal 2nd terminal), and enter the Kafka directory, enter:bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Test--from-beginning
  • Now, in terminal 1th input haha, if terminal 2nd can output haha, Kafka test success!

SBT builds a Scala program for word counting

  • Create a new folder named Spark_kafka
  • Enter Spark_kafka, press/src/main/scala/kafkademo.scala level directory to create a new Kafkademo.scala
  • Create a new project directory under the Spark_kafka directory under project New PLUGINS.SBT
  • Create a new ASSEMBLY.SBT under the Spark_kafka directory
  • Finally, the directory structure you see is as follows 
        • spark_kafka/
        • Spark_kafka/src
        • Spark_kafka/src/main
        • Spark_kafka/src/main/scala
        • Spark_kafka/src/main/scala/kafkademo.scala
        • Spark_kafka/project
        • Spark_kafka/project/plugins.sbt
        • Spark_kafka/assembly.sbt

Where the Kafkademo.scala code is as follows

Importjava.util.PropertiesImportkafka.producer._Importorg.apache.spark.streaming._Importorg.apache.spark.streaming.streamingcontext._Importorg.apache.spark.streaming.kafka._Importorg.apache.spark.SparkConfobject Kafkademo {def main (args:array[string]) {val Zkquorum= "127.0.0.1:2181"Val Group= "Test-consumer-group"Val Topics= "Test"Val numthreads= 2Val sparkconf=NewSparkconf (). Setappname ("Kafkawordcount"). Setmaster ("local[2]") Val SSC=NewStreamingContext (sparkconf, Seconds (10)) Ssc.checkpoint ("Checkpoint") Val Topicpmap= Topics.split (","). Map ((_,numthreads.toint)). Tomap Val Lines=Kafkautils.createstream (SSC, Zkquorum, Group, TOPICPMAP). Map (_._2) val words= Lines.flatmap (_.split ("")) Val pairs= Words.map (Word = + (Word, 1)) Val wordcounts= Pairs.reducebykey (_ +_) Wordcounts.print () Ssc.start () Ssc.awaittermination ()}}

The ASSMEBLY.SBT code is as follows

Name: = "Kafkademo" Version: = "1.0" scalaversion: = "2.10.4" Librarydependencies ++= Seq (("Org.apache.spark" percent "Spark-cor E "%" 1.3.1 "%" provided ")) Librarydependencies + =" Org.apache.spark "%" spark-streaming_2.10 "%" 1.3.1 "%" provided "Librar Ydependencies + = "Org.apache.spark"% "spark-streaming-kafka_2.10"% "1.3.0" mergestrategy in Assembly <<= (    Mergestrategy in assembly) {(old) = {case PathList ("org", "Apache", xs @ _*) = Mergestrategy.first  Case PathList (PS @ _*) if Ps.last endsWith "axiom.xml" = Mergestrategy.filterdistinctlines case PathList (PS @ _*) If Ps.last endsWith "log$logger.class" = Mergestrategy.first case PathList (PS @ _*) if Ps.last endsWith "Iloggerfa Ctory.class "= Mergestrategy.first case x = old (x)}}resolvers + =" Oschina Maven Repository "at" Http://maven. oschina.net/content/groups/public/"Externalresolvers: = Resolver.withdefaultresolvers (Resolvers.value, Mavencentral = False)

  

PLUGINS.SBT content is as follows:

Addsbtplugin ("com.eed3si9n"% "sbt-assembly"% "0.14.1")

Please note:

  

Mergestrategy in Assembly <<= (Mergestrategy in assembly) {(old) =
{
Case PathList ("org", "Apache", xs @ _*) = Mergestrategy.first
Case PathList (PS @ _*) if Ps.last endsWith "axiom.xml" = Mergestrategy.filterdistinctlines
Case PathList (PS @ _*) if Ps.last endsWith "log$logger.class" = Mergestrategy.first
Case PathList (PS @ _*) if Ps.last endsWith "iloggerfactory.class" = Mergestrategy.first
Case x = old (x)
}
}

This code is only for my native resolution of the conflict-dependent approach, if there is no such code, then I will be packaged with a dependency on the occurrence of conflict, because the different packages have the same class, the solution is to merge dependencies, the following is the code to paste the error:

[ERROR] (*:assembly) deduplicate:different file contents found in the following:
[ERROR]/home/hadoop/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_ 2.10-1.3.0.jar:org/apache/Spark/unused/unusedstubclass.class
[ERROR]/home/hadoop/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:org/apache/Spark /unused/unusedstubclass.class

Attention to the red highlighted code, when you happen to other dependency conflicts, you can tiger, resolve the dependency conflict

Next, is in a better network environment for packaging, terminal into the Spark_kafka directory, enter the SBT assembly, patience and other generation download packaging

Spark Streaming docking Kafka production message port
    • Start Hadoop
    • Background boot Kafka zookeeper and server side
    • Start the producer command line (followed by the word count of spark on the input string)
    • New terminal enters Spark_kafka directory, enter

      $SPARK _home/bin/spark-submit--class "Kafkademo" Target/scala-2.10/kafkademo-assembly-1.0.jar

      (If the package succeeds, there will be a target directory with Scala-2.10/kafkademo-assembly-1.0.jar under target).
    • Then enter a series of strings in producer, and spark streaming will process

If you can see the result, congratulations.

To get this in fact for a while, the main problem is to rely on the resolution, as well as the version of the problem.    If everyone in the process of doing found that there is scala:no such method ... And so on, the description is that the Scala version does not conform to the

Other questions you can Google, also stressed that the above command is related to my personal directory environment, such as $spark_home on behalf of my own SPARK path, if your directory is not the same as me, I want to change;

This article is for Linux-based students, understand the basic environment configuration, this is the minimum requirements! This article also gives itself, after all really hard!

SBT build Spark streaming integrated Kafka (Scala version)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.