Spark streaming working with the database through JDBC

Source: Internet
Author: User
Tags postgresql

This article documents the process of learning to use the spark streaming to manipulate the database through JDBC, where the source data is read from the Kafka.

Kafka offers a new consumer API from version 0.10, and 0.8 different, so spark streaming also provides two APIs to correspond to, where Spark-streaming-kafka-0-8 supports Kafka 0.8.2.1 after the BROKER;SPARK-STREAMING-KAFKA-0-10 support 0.10.0 above broker, in the experimental phase. The comparison between the two is shown in the following table.

|spark-streaming-kafka-0-8 | Spark-streaming-kafka-0-10
--------------- |------------------------ |--------------------------
Broker Version | 0.8.2.1 or higher | 0.10.0 or higher
Api Stability | Stable | Experimental
Language Support | Scala, Java, Python | Scala, Java
Receiver DStream | Yes | No
Direct DStream | Yes | Yes
SSL/TLS Support | No | Yes
Offset Commit Api | No | Yes
Dynamic Topic Subscription | No | Yes

The Spark streaming integrated Kafka instructions can refer to the following information:

spark-streaming-kafka-0-10 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

spark-streaming-kafka-0-8 http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html

1.2 Running components

The examples written in this article are based on JDK1.8 and Scala 2.11, where the dependent components are run as shown in the following table.

Components Deployment Method IP Address Operating System
spark-2.0.1 Pseudo-distributed 192.168.1.91 CentOS 7.1
Kafka-0.10.0.1 Pseudo-distributed 192.168.1.90 CentOS 7.1
postgresql-9.4.5 Single 192.168.1.213 CentOS 7.1

A table named Kafka_message has been created in the database, with three fields, all of which are varchar types.

CREATE TABLE kafka_message (    timeseq varchar(16),    thread varchar(32),    message varchar(255));
2. Code 2.1 pom.xml

The dependent Lib is as follows.

<dependencies>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-streaming_2.11</artifactId>        <version>2.0.1</version>    </dependency>    <dependency>        <groupId>org.apache.spark</groupId>        <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>        <version>2.0.1</version>    </dependency>    <dependency>        <groupId>com.jolbox</groupId>        <artifactId>bonecp</artifactId>        <version>0.8.0.RELEASE</version>    </dependency>    <dependency>        <groupId>postgresql</groupId>        <artifactId>postgresql</artifactId>        <version>9.1-901-1.jdbc4</version>    </dependency></dependencies>
2.2 Database Connection Pool
Import Java.sql.Connectionimport COM.JOLBOX.BONECP. {BONECP, bonecpconfig}import org.slf4j.LoggerFactory/** * database connection pool, using BONECP */object connectionpool {val logger = Log Gerfactory.getlogger (This.getclass)///connection pool configuration private val Connectionpool:option[bonecp] = {try{Class.forName ("O Rg.postgresql.Driver ") Val config = new Bonecpconfig () Config.setjdbcurl (" Jdbc:postgresql://192.168.1.213/yourdb ") Config.setusername (" Postgres ") Config.setpassword (" ****** ") Config.setlazyinit (True) config.setminc Onnectionsperpartition (3) config.setmaxconnectionsperpartition (5) Config.setpartitioncount (5) Config.setClo Seconnectionwatch (True) config.setlogstatementsenabled (false) Some (new Bonecp (config))}catch {case EXC  Eption:exception = Logger.warn ("Create Connection Error: \ n" + exception.printstacktrace ()) None} }//Get database connection def getconnection:option[connection] = {ConnectionPool match {case soMe (Pool) = Some (pool.getconnection) Case None = none}}//Release database connection Def closeconnection (connection:conn ection): Unit = {if (!connection.isclosed) {Connection.close ()}}
2.3 Kafka, spark-streaming, JDBC

Spark streaming reads data from the Kafka and writes the data to the database. The basic sequence of SPark streaming programming is:

    1. Create Spark streaming context
    2. Creating dstream from the data source interface
    3. Do the conversion on Dstream (transformations)
    4. Specify the location where the calculation results are stored
    5. Start calculation

The code is as follows, see note.

Import Java.sql.Connectionimport Org.apache.kafka.common.serialization.StringDeserializerimport Org.apache.spark.SparkConfimport Org.apache.spark.streaming.kafka010.consumerstrategies._import Org.apache.spark.streaming.kafka010.KafkaUtilsimport org.apache.spark.streaming.kafka010.locationstrategies._ Import org.apache.spark.streaming.  {Seconds, streamingcontext}import org.slf4j.loggerfactory/** * reads data from Kafka and writes data to the database. */object Kafkatodb {val logger = Loggerfactory.getlogger (This.getclass) def main (args:array[string]): Unit = {//Parameter Number of Check if (Args.length < 2) {System.err.println (S "" "" |  Usage:kafkatodb <brokers> <topics> |  <brokers> Broker list, at least 1, comma-delimited | <topics> topic list, at least one, comma-delimited | "". Stripmargin) System.exit (1)}//Processing parameter Val Array (brokers, topics) = args val topicset:set[string] = Topic S.split (","). Toset val kafkaparams:map[string, Object] = Map[string, Object] ("Bootstrap.servers "-Brokers," Key.deserializer ", Classof[stringdeserializer]," Value.deserializer "-&G T Classof[stringdeserializer], "group.id", "Example", "Auto.offset.reset", "latest", "enable.auto.c Ommit "(False:java.lang.Boolean))//1. Create the context, with data every 1 seconds interval as a batch Val sparkconf = new sparkconf (). Setappname (" Kafkatodb ") Val StreamingContext = new StreamingContext (sparkconf, Seconds (1))//2. Create an input stream to get the data.      The stream operation is based on Dstream,inputdstream inherited from Dstream val stream = kafkautils.createdirectstream[string, String] (StreamingContext, Preferconsistent, Subscribe[string, String] (Topicset, kafkaparams))//3. The conversion action on the Dstream//Fetch value data in the message, separated by commas, and turned into Tuple3 val values = Stream.map (_.value.split (",")). Filter (x = X.L Ength = = 3). map (x = new tuple3[string, String, String] (x (0), X (1), X (2)))//input first 10 to the console for easy commissioning Values.print ( )//4. Save to database with Foreachrdd val sql = "INSERT INTO kafka_message (timesEQ, thread, message) VALUES (?,?,?) " Values.foreachrdd (Rdd = {val count = Rdd.count () println ("-----------------count:" + count) if (count          > 0) {rdd.foreachpartition (partitionofrecords = {Val conn = ConnectionPool.getConnection.orNull IF (conn! = null) {Partitionofrecords.foreach (data = INSERT (conn, SQL, data)) Connectio Npool.closeconnection (conn)}})})//5. Start Calculation streamingcontext.start () streamingcontext.awaittermination ()//wait for interrupt end calculation}/** * Save data to Database * @ PARAM Conn Database connection * @param SQL prepared statement SQL * @param data to be saved, TUPLE3 structure */def insert (conn:connecti On, sql:string, data: (String, String, String)): Unit = {try {val ps = conn.preparestatement (SQL) PS.SETST Ring (1, Data._1) ps.setstring (2, Data._2) ps.setstring (3, Data._3) ps.executeupdate () Ps.close ()} Catch {case E:exception => Logger.error ("Error in execution of insert." + E.getmessage)}}
3. Task run 3.1 database-driven configuration

Since this program operating environment is a pseudo-distributed spark standalone, specifying Spark_classpath is equivalent to specifying both driver and executor classpath.

Edit spark-env.sh.

vi $SPARK_HOME/conf/spark-env.sh

Enter the following and note that the PostgreSQL driver package is placed in the appropriate location.

export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar

In fact, Spark has not recommended the use of the Spark_classpath parameter, and when it starts, we will find the following log, prompting US spark.executor.extraClassPath to use and spark.driver.extraClassPath to replace. If it is spark local mode, just specify it spark.driver.extraClassPath .

The log is as follows.

16/10/21 15:15:33 WARN SparkConf: SPARK_CLASSPATH was detected (set to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘).This is deprecated in Spark 1.0+.Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath        16/10/21 15:15:33 WARN SparkConf: Setting ‘spark.executor.extraClassPath‘ to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘ as a work-around.16/10/21 15:15:33 WARN SparkConf: Setting ‘spark.driver.extraClassPath‘ to ‘:/opt/jobs/postgresql-9.1-901-1.jdbc4.jar‘ as a work-around.
3.2 Starting a task
$SPARK_HOME/bin/spark-submit \ --master spark://192.168.1.91:7077--class com.xxx.streaming.KafkaToDB  spark-streaming-demo.jar 192.168.1.90:9092 my-topic

On 192.168.1.90, a message is sent via the Kafka command, with the following command:

bin/kafka-console-producer.sh --broker-list 192.168.1.90:9092 --topic mytopic

The message is as follows:

1475589918658    thread-15    msg-01475589920177    thread-15    msg-1

After that, we can see the message being output from the console, or we can query the data in the database.

4.Spark program could not find JDBC driver problem

The console has reported that the JDBC-driven exception could not be found and the log is as follows.

java.sql.SQLException: No suitable driver found for jdbc:……

The problem can be resolved using the database-driven configuration method described earlier. Recommend a blog here to give a good explanation of this problem.
http://www.codexiu.cn/spark/blog/12672/

Blog Source: Https://www.jianshu.com/p/a73c0c95d2fe

Spark streaming working with the database through JDBC

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.