Spark SQL External Data Sources jdbc Simple implementation

Source: Internet
Author: User
Tags db2 relational database table

The most anticipated feature in the spark1.2 version is external Data Sources, which allows you to directly register external data Sources as a temporary table that can be queried via SQL with existing tables and so on. The External Data Sources API code is stored in the Org.apache.spark.sql package.


Specific analysis can be found in Oopsoutofmemory's two great blog posts:

http://blog.csdn.net/oopsoom/article/details/42061077

http://blog.csdn.net/oopsoom/article/details/42064075

Try to implement a simple external data source to read the relational database, code see: HTTPS://GITHUB.COM/LUOGANKUN/SPARK-JDBC

Support MYSQL/ORACLE/DB2, as well as several simple data types, temporarily does not support Prunedscan, Prunedfilteredscan, only support Tablescan, follow-up in the next perfect.

Steps to use:

1. Compiling SPARK-JDBC code

SBT Package

2. Add jar package to spark-env.sh

Export spark_classpath=/home/spark/software/source/spark_package/spark-jdbc/target/scala-2.10/ Spark-jdbc_2. 0.1. Jar: $SPARK _classpathexport spark_classpath=/home/spark/lib/ojdbc14-10.2 . 0.3 . Jar: $SPARK _classpathexport spark_classpath=/home/spark/lib/db2jcc-9.7. Jar: $SPARK _ Classpathexport Spark_classpath=/home/spark/lib/mysql-connector-java-3.0. jar: $SPARK _classpath

3. Start Spark-sql

CREATE Temporary TABLEjdbc_tableusing Com.luogankun.spark.jdbcOPTIONS (Sparksql_table_schema'(tbl_id int, tbl_name string, tbl_type string)', Jdbc_table_name'TBLs', Jdbc_table_schema'(tbl_id, Tbl_name, Tbl_type)', the URL'jdbc:mysql://hadoop000:3306/hive',User    'Root', Password'Root', Num_partitions'6',where"Tbl_id> 766  andTbl_name='order_created_4_partition'");

Parameter description:

Sparksql_table_schema:spark SQL table field names and types

Jdbc_table_name: relational database table name

Jdbc_table_schema: Relational database table field name

URL: relational database URL

User: Relational database username

Password: relational database password

Num_partitions:partitions number, default is 5, can be omitted

Where: Filter conditions, can be omitted

Select  from Jdbc_table;

Problems encountered during the testing process:

The code above does not have a problem connecting to the MySQL database operation, but when operating the Oracle or DB2 database, the error is as follows:

 the: About: -,302 [Executor Task Launch worker-0]ERROR Logging$class:errorinchTaskCompletionListenerjava.lang.AbstractMethodError:oracle.jdbc.driver.OracleResultSetImpl.isClosed () Z at org.apache.spark.rdd.jdbcrdd$ $anon $1.Close(Jdbcrdd.scala: About) at org.apache.spark.util.NextIterator.closeIfNeeded (Nextiterator.scala: the) at org.apache.spark.rdd.jdbcrdd$ $anon $1$ $anonfun $1. Apply (Jdbcrdd.scala: in) at org.apache.spark.rdd.jdbcrdd$ $anon $1$ $anonfun $1. Apply (Jdbcrdd.scala: in) at org.apache.spark.taskcontext$ $anon $1. Ontaskcompletion (Taskcontext.scala: -) at org.apache.spark.taskcontext$ $anonfun $marktaskcompleted$1. Apply (Taskcontext.scala: the) at org.apache.spark.taskcontext$ $anonfun $marktaskcompleted$1. Apply (Taskcontext.scala:108) at Scala.collection.mutable.resizablearray$class.foreach (Resizablearray.scala: -) at Scala.collection.mutable.ArrayBuffer.foreach (Arraybuffer.scala: -) at org.apache.spark.TaskContext.markTaskCompleted (Taskcontext.scala:108) at Org.apache.spark.scheduler.ResultTask.runTask (Resulttask.scala: -) at Org.apache.spark.scheduler.Task.run (Task.scala: Wu) at Org.apache.spark.executor.executor$taskrunner.run (Executor.scala:181) at Java.util.concurrent.ThreadPoolExecutor.runWorker (Threadpoolexecutor.java:1145) at Java.util.concurrent.threadpoolexecutor$worker.run (Threadpoolexecutor.java:615) at Java.lang.Thread.run (Thread.java:744) the: About: -,302 [Executor Task Launch Worker-1]ERROR Logging$class:errorinchTaskcompletionlistener

Followed the next Jdbcrdd source code discovery, the problem is:

The Oracle driver I used in this case was Ojdbc14-10.2.0.3.jar, and I looked at some data that the Oracle implementation class did not have the method;

The issues see: https://issues.apache.org/jira/browse/SPARK-5239

Workaround:

1, upgrade the driver package;

2, temporarily shield out these two isclosed judgment method (https://github.com/apache/spark/pull/4033)

Follow-up will continue to improve the implementation of Prunedscan, Prunedfilteredscan, and now the implementation is really "ugly", do not make the first use of it.

Spark SQL External Data Sources jdbc Easy implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.