Spark streaming working with the database through JDBC

Last Update:2018-01-25 Source: Internet

Author: User

Tags postgresql

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article documents the process of learning to use the spark streaming to manipulate the database through JDBC, where the source data is read from the Kafka.

Kafka offers a new consumer API from version 0.10, and 0.8 different, so spark streaming also provides two APIs to correspond to, where Spark-streaming-kafka-0-8 supports Kafka 0.8.2.1 after the BROKER;SPARK-STREAMING-KAFKA-0-10 support 0.10.0 above broker, in the experimental phase. The comparison between the two is shown in the following table.

The Spark streaming integrated Kafka instructions can refer to the following information:

spark-streaming-kafka-0-10 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

spark-streaming-kafka-0-8 http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html

1.2 Running components

The examples written in this article are based on JDK1.8 and Scala 2.11, where the dependent components are run as shown in the following table.

Components	Deployment Method	IP Address	Operating System
spark-2.0.1	Pseudo-distributed	192.168.1.91	CentOS 7.1
Kafka-0.10.0.1	Pseudo-distributed	192.168.1.90	CentOS 7.1
postgresql-9.4.5	Single	192.168.1.213	CentOS 7.1

A table named Kafka_message has been created in the database, with three fields, all of which are varchar types.

CREATE TABLE kafka_message (    timeseq varchar(16),    thread varchar(32),    message varchar(255));

2. Code 2.1 pom.xml

The dependent Lib is as follows.

<dependencies>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-streaming_2.11</artifactId>        <version>2.0.1</version>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>        <version>2.0.1</version>    </dependency>    <dependency>        <groupId>com.jolbox</groupId>        <artifactId>bonecp</artifactId>        <version>0.8.0.RELEASE</version>    </dependency>    <dependency>        <groupId>postgresql</groupId>        <artifactId>postgresql</artifactId>        <version>9.1-901-1.jdbc4</version>    </dependency></dependencies>

2.2 Database Connection Pool

Import Java.sql.Connectionimport COM.JOLBOX.BONECP. {BONECP, bonecpconfig}import org.slf4j.LoggerFactory/** * database connection pool, using BONECP */object connectionpool {val logger = Log Gerfactory.getlogger (This.getclass)///connection pool configuration private val Connectionpool:option[bonecp] = {try{Class.forName ("O Rg.postgresql.Driver ") Val config = new Bonecpconfig () Config.setjdbcurl (" Jdbc:postgresql://192.168.1.213/yourdb ") Config.setusername (" Postgres ") Config.setpassword (" ****** ") Config.setlazyinit (True) config.setminc Onnectionsperpartition (3) config.setmaxconnectionsperpartition (5) Config.setpartitioncount (5) Config.setClo Seconnectionwatch (True) config.setlogstatementsenabled (false) Some (new Bonecp (config))}catch {case EXC  Eption:exception = Logger.warn ("Create Connection Error: \ n" + exception.printstacktrace ()) None} }//Get database connection def getconnection:option[connection] = {ConnectionPool match {case soMe (Pool) = Some (pool.getconnection) Case None = none}}//Release database connection Def closeconnection (connection:conn ection): Unit = {if (!connection.isclosed) {Connection.close ()}}

2.3 Kafka, spark-streaming, JDBC

Spark streaming reads data from the Kafka and writes the data to the database. The basic sequence of SPark streaming programming is:

Create Spark streaming context
Creating dstream from the data source interface
Do the conversion on Dstream (transformations)
Specify the location where the calculation results are stored
Start calculation

The code is as follows, see note.

Import Java.sql.Connectionimport Org.apache.kafka.common.serialization.StringDeserializerimport Org.apache.spark.SparkConfimport Org.apache.spark.streaming.kafka010.consumerstrategies._import Org.apache.spark.streaming.kafka010.KafkaUtilsimport org.apache.spark.streaming.kafka010.locationstrategies._ Import org.apache.spark.streaming.  {Seconds, streamingcontext}import org.slf4j.loggerfactory/** * reads data from Kafka and writes data to the database. */object Kafkatodb {val logger = Loggerfactory.getlogger (This.getclass) def main (args:array[string]): Unit = {//Parameter Number of Check if (Args.length < 2) {System.err.println (S "" "" |  Usage:kafkatodb <brokers> <topics> |  <brokers> Broker list, at least 1, comma-delimited | <topics> topic list, at least one, comma-delimited | "". Stripmargin) System.exit (1)}//Processing parameter Val Array (brokers, topics) = args val topicset:set[string] = Topic S.split (","). Toset val kafkaparams:map[string, Object] = Map[string, Object] ("Bootstrap.servers "-Brokers," Key.deserializer ", Classof[stringdeserializer]," Value.deserializer "-&G T Classof[stringdeserializer], "group.id", "Example", "Auto.offset.reset", "latest", "enable.auto.c Ommit "(False:java.lang.Boolean))//1. Create the context, with data every 1 seconds interval as a batch Val sparkconf = new sparkconf (). Setappname (" Kafkatodb ") Val StreamingContext = new StreamingContext (sparkconf, Seconds (1))//2. Create an input stream to get the data.      The stream operation is based on Dstream,inputdstream inherited from Dstream val stream = kafkautils.createdirectstream[string, String] (StreamingContext, Preferconsistent, Subscribe[string, String] (Topicset, kafkaparams))//3. The conversion action on the Dstream//Fetch value data in the message, separated by commas, and turned into Tuple3 val values = Stream.map (_.value.split (",")). Filter (x = X.L Ength = = 3). map (x = new tuple3[string, String, String] (x (0), X (1), X (2)))//input first 10 to the console for easy commissioning Values.print ( )//4. Save to database with Foreachrdd val sql = "INSERT INTO kafka_message (timesEQ, thread, message) VALUES (?,?,?) " Values.foreachrdd (Rdd = {val count = Rdd.count () println ("-----------------count:" + count) if (count          > 0) {rdd.foreachpartition (partitionofrecords = {Val conn = ConnectionPool.getConnection.orNull IF (conn! = null) {Partitionofrecords.foreach (data = INSERT (conn, SQL, data)) Connectio Npool.closeconnection (conn)}})})//5. Start Calculation streamingcontext.start () streamingcontext.awaittermination ()//wait for interrupt end calculation}/** * Save data to Database * @ PARAM Conn Database connection * @param SQL prepared statement SQL * @param data to be saved, TUPLE3 structure */def insert (conn:connecti On, sql:string, data: (String, String, String)): Unit = {try {val ps = conn.preparestatement (SQL) PS.SETST Ring (1, Data._1) ps.setstring (2, Data._2) ps.setstring (3, Data._3) ps.executeupdate () Ps.close ()} Catch {case E:exception => Logger.error ("Error in execution of insert." + E.getmessage)}}

3. Task run 3.1 database-driven configuration

Since this program operating environment is a pseudo-distributed spark standalone, specifying Spark_classpath is equivalent to specifying both driver and executor classpath.

Edit spark-env.sh.

vi $SPARK_HOME/conf/spark-env.sh

Enter the following and note that the PostgreSQL driver package is placed in the appropriate location.

export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar

In fact, Spark has not recommended the use of the Spark_classpath parameter, and when it starts, we will find the following log, prompting US spark.executor.extraClassPath to use and spark.driver.extraClassPath to replace. If it is spark local mode, just specify it spark.driver.extraClassPath .

The log is as follows.

16/10/21 15:15:33 WARN SparkConf: SPARK_CLASSPATH was detected (set to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘).This is deprecated in Spark 1.0+.Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath        16/10/21 15:15:33 WARN SparkConf: Setting ‘spark.executor.extraClassPath‘ to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘ as a work-around.16/10/21 15:15:33 WARN SparkConf: Setting ‘spark.driver.extraClassPath‘ to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘ as a work-around.

3.2 Starting a task

$SPARK_HOME/bin/spark-submit \ --master spark://192.168.1.91:7077--class com.xxx.streaming.KafkaToDB  spark-streaming-demo.jar 192.168.1.90:9092 my-topic

On 192.168.1.90, a message is sent via the Kafka command, with the following command:

bin/kafka-console-producer.sh --broker-list 192.168.1.90:9092 --topic mytopic

The message is as follows:

1475589918658    thread-15    msg-01475589920177    thread-15    msg-1

After that, we can see the message being output from the console, or we can query the data in the database.

4.Spark program could not find JDBC driver problem

The console has reported that the JDBC-driven exception could not be found and the log is as follows.

java.sql.SQLException: No suitable driver found for jdbc:……

The problem can be resolved using the database-driven configuration method described earlier. Recommend a blog here to give a good explanation of this problem.
http://www.codexiu.cn/spark/blog/12672/

Blog Source: Https://www.jianshu.com/p/a73c0c95d2fe

Spark streaming working with the database through JDBC

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More