Seamless combination of SparkStream2.0.0 and Kafka _

Seamless combination of SparkStream2.0.0 and Kafka __kafka

Last Update:2018-08-20 Source: Internet

Author: User

Tags message queue postgresql apache log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka is a distributed publish-subscribe message system, simply a message queue, the advantage is that the data is persisted to disk (the focus of this article is not to introduce Kafka, do not say more). Kafka's use of the scene is still quite a lot, for example, as a buffer queue between asynchronous systems, in addition, in many scenarios we would design the following: write some data (such as logs) to Kafka for persistent storage, and then another service to consume the data in the Kafka to do business-level analysis, The analysis results are then written to HBase or HDFs; because the design is generic, large data streaming frameworks like storm already support seamless connections to Kafka. Of course, as an up-and-comer, Spark also provides native support for Kafka.

This article is to introduce the spark streaming + Kafka of the actual combat.

<project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "Http://www.w3.org/2001/XMLSchema-instance" xsi: schemalocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > < Modelversion>4.0.0</modelversion> <groupId>sprakStream</groupId> <artifactId> Sprakstream</artifactid> <version>0.0.1-SNAPSHOT</version> <dependencies> <!--jar dependencies correct- -> <dependency> <groupId>org.apache.spark</groupId> <artifactid>spark-core_2.11</ar tifactid> <version>2.0.0</version> <scope>provided</scope> </dependency> <!- -The jar relies on the correct--> <dependency> <groupId>org.apache.spark</groupId> <artifactid>spark-sql_2. 11</artifactid> <version>2.0.0</version> <scope>provided</scope> </dependency&gt

		; <!--jar relies on the correct--> <dependency> <groupid>org.Apache.spark</groupid> <artifactId>spark-streaming_2.11</artifactId> <version>2.0.0</
			Version> <scope>provided</scope> </dependency> <!--jar dependencies correct--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> < version>2.0.0</version> <scope>provided</scope> </dependency> <dependency> ;groupid>org.apache.spark</groupid> <artifactid>spark-streaming-kafka-0-10_2.11</artifactid > <version>2.0.0</version> </dependency> <dependency> <groupid>org.apache.hbas e</groupid> <artifactId>hbase-client</artifactId> <version>1.2.1</version> <scop e>provided</scope> </dependency> <dependency> <groupid>org.apache.hbase</groupid > <artifactid>hbase-server</artifactid> <version>1.2.1</version> <scope>provided</scope> </dependency> <dependenc Y> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>2.8. 0</version> <scope>provided</scope> </dependency> <dependency> <groupid>org .postgresql</groupid> <artifactId>postgresql</artifactId> <version>9.4-1202-jdbc4</ version> <scope>provided</scope> </dependency> <dependency> <groupid>net.sf.js on-lib</groupid> <artifactId>json-lib</artifactId> <version>2.2.3</version> </de pendency> <dependency> <groupId>org.apache.commons</groupId> <artifactid>commons-pool2 </artifactId> <version>2.2</version> </dependency> </dependencies> <build> & Lt;sourcedirectory>${basedir}/src/main/scala</sourcedirectory> <testsourcedirectory>${basedir}/src/test/scala</ testsourcedirectory> <resources> <resource> <directory>${basedir}/src/main/resources</dir ectory> </resource> </resources> <testResources> <testResource> <directory> ${basedir}/src/test/resources</directory> </testResource> </testResources> <plugins> ;p lugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <
			configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactid>mave n-shade-plugin</artifactid> <version>2.2</version> <configuration> <createdependen Cyreducedpom>true</createdependencyreducedpom> </configuration> <executions> <execution> <phase>package</phase&gt
						; <goals> <goal>shade</goal> </goals> <configuration> <artifactse t> <includes> <include>*:* </include> </includes> </artifacts et> <filters> <filter> <artifact>*:* </artifact> &LT;EXCLUDES&G
										T <exclude>meta-inf/*. Sf</exclude> <exclude>meta-inf/*. Dsa</exclude> <exclude>meta-inf/*. 
								rsa</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation= "Org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/&G
								T <transformer implementation= "Org.apache.maven.plugins.shade.resource.AppendingTransformEr "> <resource>reference.conf</resource> </transformer> <transformer implementation= "Org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer" > <resource>
					log4j.properties</resource> </transformer> </transformers> </configuration>

 </execution> </executions> </plugin> </plugins> </build> </project>

Package Com.sprakStream.demo Import java.util.Properties import java.util.regex.Matcher Import Org.apache.kafka.common.serialization.StringDeserializer Import org.apache.spark.SparkConf Import
Org.apache.spark.SparkContext Import org.apache.spark.sql.SQLContext Import org.apache.spark.sql.SparkSession Import org.apache.spark.streaming.Seconds Import Org.apache.spark.streaming.StreamingContext Import Org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe Import Org.apache.spark.streaming.kafka010.KafkaUtils Import Org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent Import Org.apache.kafka.common.TopicPartition Import org.apache.spark.streaming.kafka010.ConsumerStrategies Import Org.apache.spark.streaming.kafka010.LocationStrategies Import Org.apache.spark.streaming.kafka010.HasOffsetRanges Import Org.apache.spark.streaming.kafka010.OffsetRange Import Org.apache.spark.TaskContext Import com.sprakStream.util.AppConstant Import Com.sprakStream.bean.IpMapper
Import com.sprakStream.util.CommUtil Import kafka.common.TopicAndPartition import Com.logger.util.LoggerUtil Object Kafkaexampleoffset {def main (args:array[string]): unit = {//val conf = new sparkconf ()//val sc = new Spark
    Context ()//Housing enterprise environment//System.setproperty ("Spark.sql.warehouse.dir", "d:\\tools\\spark-2.0.0-bin-hadoop2.6");
    System.setproperty ("Hadoop.home.dir", "d:\\tools\\hadoop-2.6.0");
    The company's environmental System.setproperty ("Spark.sql.warehouse.dir", "d:\\developtool\\spark-2.0.0-bin-hadoop2.6"); println ("Success to Init ...") Val url = "Jdbc:postgresql://172.16.12.190:5432/dataex_tmp" val prop = new Propertie S () prop.put ("User", "Postgres") prop.put ("Password", "Issing") Val conf = new sparkconf (). Setappname ("Wordcou NT "). Setmaster (" local ") Val SSC = new StreamingContext (conf, Seconds (2)) Val sparksession = Sparksession.builder ()
    . config (conf). Getorcreate () Val util = Utilities; Util.setuplogging ()//COnstruct a regular expression (regex) to extract fields from raw Apache log lines val pattern = Util.apachelogpatter N ()//Hostname:port for Kafka brokers, not zookeeper val kafkaparams = map[string, Object] ("bootstrap.se RVers "-> appconstant.kafka_host," Key.deserializer "-> classof[stringdeserializer", "Value.deserializer "-> Classof[stringdeserializer]," group.id "->" Example "," Enable.auto.commit "-> (False:java.lang.
      Boolean)//"Auto.offset.reset"-> "latest",//"Auto.offset.reset"-> "largest"//automatically reset offset to latest offset (default) "Auto.offset.reset"-> "earliest" automatically resets offsets to the earliest offset//"Auto.offset.reset"-> "none"//If no To find a previous offset for the consumer group, throw an exception to the consumer)//List of topics you want to listen for from Kafka val topics = List (Appconsta Nt. kafka_topic). Toset/** * KAFKA offset//** * Read KAKFA data starting at specified location * Note: Due to exactly ONC E's mechanism, so that in any case, the data will only be consumed by aTimes. * Specifies the beginning of offset, will be from the last streaming program stop, start reading Kafka data///experiment, when topicpartition have been put into the offsets, the program can go to consumption, otherwise do not consume; The fee pattern is based on partitioning, one partition for a partition consumption//2l:l represents a Long type, 2 refers to starting with a message with an offset value of 2 val offsets = map[topicpartition, Long] (new Topicparti tion (appconstant.kafka_topic, 0)-> 5000L, New Topicpartition (Appconstant.kafka_topic, 1)-> 5000L, new Topicpartition (Appconstant.kafka_topic, 2)-> 5000L)//Through Kafkautils.createdirectstream (...)
      Obtain Kafka data, Kafka correlation parameters specified by kafkaparams val line = Kafkautils.createdirectstream (SSC, Preferconsistent,
    Consumerstrategies.subscribe[string, String] (topics, kafkaparams, offsets)); Data manipulation Line.foreachrdd (mess => {//Get offset collection val offsetslist = mess.asinstanceof[hasoffsetranges].offs Etranges mess.foreachpartition (lines => {Lines.foreach (line => {//println ()//PR Intln ("---------------------------------------------------------------------------------------") Val o:offsetrange = offsetslist (TaskContext.get.partitionId) println (" +++++++++++++++ +++++++++++++++ here record offset+++++++++++++++++++++++++++++++++++++++ ")//println ("--topic:: "+ o.topic +"--partitio N: "+ o.partition +"--fromoffset: "+ O.fromoffset +"--untiloffset: "+ o.untiloffset"//println ("++++++++++++++
          +++++++++++++++++ Here consumer data Operations ++++++++++++++++++++++++++++++++++++++ ") println (" The Kafka Line are "+ line) Loggerutil.loggertobuffer (Line.tostring ())//println ("-------------------------------------------------------- -------------------------------")//println ()})})//Kick it off Ssc.checkpoint ("/user/root/spark/checkpoint") Ssc.start () ssc.awaittermination () println ("kafkaexample-end ..... ...)". . ... ")} object SQLContextSingleton2 {@transient private var Instance:sqlcontext = _ def getinstance (s). Parkcontext:sparkcontexT): SqlContext = {if (instance = = null) {instance = new SqlContext (Sparkcontext)} instance}}

Note: The above test pass, can be modified as needed. If you have any questions, please leave a message.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More