Seamless combination of SparkStream2.0.0 and Kafka __kafka

Source: Internet
Author: User
Tags message queue postgresql apache log
Kafka is a distributed publish-subscribe message system, simply a message queue, the advantage is that the data is persisted to disk (the focus of this article is not to introduce Kafka, do not say more). Kafka's use of the scene is still quite a lot, for example, as a buffer queue between asynchronous systems, in addition, in many scenarios we would design the following: write some data (such as logs) to Kafka for persistent storage, and then another service to consume the data in the Kafka to do business-level analysis, The analysis results are then written to HBase or HDFs; because the design is generic, large data streaming frameworks like storm already support seamless connections to Kafka. Of course, as an up-and-comer, Spark also provides native support for Kafka.

This article is to introduce the spark streaming + Kafka of the actual combat.


<project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "Http://www.w3.org/2001/XMLSchema-instance" xsi: schemalocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > < Modelversion>4.0.0</modelversion> <groupId>sprakStream</groupId> <artifactId> Sprakstream</artifactid> <version>0.0.1-SNAPSHOT</version> <dependencies> <!--jar dependencies correct- -> <dependency> <groupId>org.apache.spark</groupId> <artifactid>spark-core_2.11</ar tifactid> <version>2.0.0</version> <scope>provided</scope> </dependency> <!- -The jar relies on the correct--> <dependency> <groupId>org.apache.spark</groupId> <artifactid>spark-sql_2. 11</artifactid> <version>2.0.0</version> <scope>provided</scope> </dependency&gt

		; <!--jar relies on the correct--> <dependency> <groupid>org.Apache.spark</groupid> <artifactId>spark-streaming_2.11</artifactId> <version>2.0.0</
			Version> <scope>provided</scope> </dependency> <!--jar dependencies correct--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> < version>2.0.0</version> <scope>provided</scope> </dependency> <dependency> ;groupid>org.apache.spark</groupid> <artifactid>spark-streaming-kafka-0-10_2.11</artifactid > <version>2.0.0</version> </dependency> <dependency> <groupid>org.apache.hbas e</groupid> <artifactId>hbase-client</artifactId> <version>1.2.1</version> <scop e>provided</scope> </dependency> <dependency> <groupid>org.apache.hbase</groupid > <artifactid>hbase-server</artifactid> <version>1.2.1</version> <scope>provided</scope> </dependency> <dependenc Y> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>2.8. 0</version> <scope>provided</scope> </dependency> <dependency> <groupid>org .postgresql</groupid> <artifactId>postgresql</artifactId> <version>9.4-1202-jdbc4</ version> <scope>provided</scope> </dependency> <dependency> <groupid>net.sf.js on-lib</groupid> <artifactId>json-lib</artifactId> <version>2.2.3</version> </de pendency> <dependency> <groupId>org.apache.commons</groupId> <artifactid>commons-pool2 </artifactId> <version>2.2</version> </dependency> </dependencies> <build> & Lt;sourcedirectory>${basedir}/src/main/scala</sourcedirectory> <testsourcedirectory>${basedir}/src/test/scala</ testsourcedirectory> <resources> <resource> <directory>${basedir}/src/main/resources</dir ectory> </resource> </resources> <testResources> <testResource> <directory> ${basedir}/src/test/resources</directory> </testResource> </testResources> <plugins> ;p lugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <
			configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactid>mave n-shade-plugin</artifactid> <version>2.2</version> <configuration> <createdependen Cyreducedpom>true</createdependencyreducedpom> </configuration> <executions> <execution> <phase>package</phase&gt
						; <goals> <goal>shade</goal> </goals> <configuration> <artifactse t> <includes> <include>*:* </include> </includes> </artifacts et> <filters> <filter> <artifact>*:* </artifact> &LT;EXCLUDES&G
										T <exclude>meta-inf/*. Sf</exclude> <exclude>meta-inf/*. Dsa</exclude> <exclude>meta-inf/*. 
								rsa</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation= "Org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/&G
								T <transformer implementation= "Org.apache.maven.plugins.shade.resource.AppendingTransformEr "> <resource>reference.conf</resource> </transformer> <transformer implementation= "Org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer" > <resource>
					log4j.properties</resource> </transformer> </transformers> </configuration>

 </execution> </executions> </plugin> </plugins> </build> </project>





Package Com.sprakStream.demo Import java.util.Properties import java.util.regex.Matcher Import Org.apache.kafka.common.serialization.StringDeserializer Import org.apache.spark.SparkConf Import
Org.apache.spark.SparkContext Import org.apache.spark.sql.SQLContext Import org.apache.spark.sql.SparkSession Import org.apache.spark.streaming.Seconds Import Org.apache.spark.streaming.StreamingContext Import Org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe Import Org.apache.spark.streaming.kafka010.KafkaUtils Import Org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent Import Org.apache.kafka.common.TopicPartition Import org.apache.spark.streaming.kafka010.ConsumerStrategies Import Org.apache.spark.streaming.kafka010.LocationStrategies Import Org.apache.spark.streaming.kafka010.HasOffsetRanges Import Org.apache.spark.streaming.kafka010.OffsetRange Import Org.apache.spark.TaskContext Import com.sprakStream.util.AppConstant Import Com.sprakStream.bean.IpMapper
Import com.sprakStream.util.CommUtil Import kafka.common.TopicAndPartition import Com.logger.util.LoggerUtil Object Kafkaexampleoffset {def main (args:array[string]): unit = {//val conf = new sparkconf ()//val sc = new Spark
    Context ()//Housing enterprise environment//System.setproperty ("Spark.sql.warehouse.dir", "d:\\tools\\spark-2.0.0-bin-hadoop2.6");
    System.setproperty ("Hadoop.home.dir", "d:\\tools\\hadoop-2.6.0");
    The company's environmental System.setproperty ("Spark.sql.warehouse.dir", "d:\\developtool\\spark-2.0.0-bin-hadoop2.6"); println ("Success to Init ...") Val url = "Jdbc:postgresql://172.16.12.190:5432/dataex_tmp" val prop = new Propertie S () prop.put ("User", "Postgres") prop.put ("Password", "Issing") Val conf = new sparkconf (). Setappname ("Wordcou NT "). Setmaster (" local ") Val SSC = new StreamingContext (conf, Seconds (2)) Val sparksession = Sparksession.builder ()
    . config (conf). Getorcreate () Val util = Utilities; Util.setuplogging ()//COnstruct a regular expression (regex) to extract fields from raw Apache log lines val pattern = Util.apachelogpatter N ()//Hostname:port for Kafka brokers, not zookeeper val kafkaparams = map[string, Object] ("bootstrap.se RVers "-> appconstant.kafka_host," Key.deserializer "-> classof[stringdeserializer", "Value.deserializer "-> Classof[stringdeserializer]," group.id "->" Example "," Enable.auto.commit "-> (False:java.lang.
      Boolean)//"Auto.offset.reset"-> "latest",//"Auto.offset.reset"-> "largest"//automatically reset offset to latest offset (default) "Auto.offset.reset"-> "earliest" automatically resets offsets to the earliest offset//"Auto.offset.reset"-> "none"//If no To find a previous offset for the consumer group, throw an exception to the consumer)//List of topics you want to listen for from Kafka val topics = List (Appconsta Nt. kafka_topic). Toset/** * KAFKA offset//** * Read KAKFA data starting at specified location * Note: Due to exactly ONC E's mechanism, so that in any case, the data will only be consumed by aTimes. * Specifies the beginning of offset, will be from the last streaming program stop, start reading Kafka data///experiment, when topicpartition have been put into the offsets, the program can go to consumption, otherwise do not consume; The fee pattern is based on partitioning, one partition for a partition consumption//2l:l represents a Long type, 2 refers to starting with a message with an offset value of 2 val offsets = map[topicpartition, Long] (new Topicparti tion (appconstant.kafka_topic, 0)-> 5000L, New Topicpartition (Appconstant.kafka_topic, 1)-> 5000L, new Topicpartition (Appconstant.kafka_topic, 2)-> 5000L)//Through Kafkautils.createdirectstream (...)
      Obtain Kafka data, Kafka correlation parameters specified by kafkaparams val line = Kafkautils.createdirectstream (SSC, Preferconsistent,
    Consumerstrategies.subscribe[string, String] (topics, kafkaparams, offsets)); Data manipulation Line.foreachrdd (mess => {//Get offset collection val offsetslist = mess.asinstanceof[hasoffsetranges].offs Etranges mess.foreachpartition (lines => {Lines.foreach (line => {//println ()//PR Intln ("---------------------------------------------------------------------------------------") Val o:offsetrange = offsetslist (TaskContext.get.partitionId) println (" +++++++++++++++ +++++++++++++++ here record offset+++++++++++++++++++++++++++++++++++++++ ")//println ("--topic:: "+ o.topic +"--partitio N: "+ o.partition +"--fromoffset: "+ O.fromoffset +"--untiloffset: "+ o.untiloffset"//println ("++++++++++++++
          +++++++++++++++++ Here consumer data Operations ++++++++++++++++++++++++++++++++++++++ ") println (" The Kafka Line are "+ line) Loggerutil.loggertobuffer (Line.tostring ())//println ("-------------------------------------------------------- -------------------------------")//println ()})})//Kick it off Ssc.checkpoint ("/user/root/spark/checkpoint") Ssc.start () ssc.awaittermination () println ("kafkaexample-end ..... ...)". . ... ")} object SQLContextSingleton2 {@transient private var Instance:sqlcontext = _ def getinstance (s). Parkcontext:sparkcontexT): SqlContext = {if (instance = = null) {instance = new SqlContext (Sparkcontext)} instance}} 

Note: The above test pass, can be modified as needed. If you have any questions, please leave a message.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.