The most anticipated feature in the spark1.2 version is external Data Sources, which allows you to directly register external data Sources as a temporary table that can be queried via SQL with existing tables and so on. The External Data Sources API code is stored in the Org.apache.spark.sql package.
Specific analysis can be found in Oopsoutofmemory's two great blog posts:
http://blog.csdn.net/oopsoom/article/details/42061077
http://blog.csdn.net/oopsoom/article/details/42064075
Try to implement a simple external data source to read the relational database, code see: HTTPS://GITHUB.COM/LUOGANKUN/SPARK-JDBC
Support MYSQL/ORACLE/DB2, as well as several simple data types, temporarily does not support Prunedscan, Prunedfilteredscan, only support Tablescan, follow-up in the next perfect.
Steps to use:
1. Compiling SPARK-JDBC code
SBT Package
2. Add jar package to spark-env.sh
Export spark_classpath=/home/spark/software/source/spark_package/spark-jdbc/target/scala-2.10/ Spark-jdbc_2. 0.1. Jar: $SPARK _classpathexport spark_classpath=/home/spark/lib/ojdbc14-10.2 . 0.3 . Jar: $SPARK _classpathexport spark_classpath=/home/spark/lib/db2jcc-9.7. Jar: $SPARK _ Classpathexport Spark_classpath=/home/spark/lib/mysql-connector-java-3.0. jar: $SPARK _classpath
3. Start Spark-sql
CREATE Temporary TABLEjdbc_tableusing Com.luogankun.spark.jdbcOPTIONS (Sparksql_table_schema'(tbl_id int, tbl_name string, tbl_type string)', Jdbc_table_name'TBLs', Jdbc_table_schema'(tbl_id, Tbl_name, Tbl_type)', the URL'jdbc:mysql://hadoop000:3306/hive',User 'Root', Password'Root', Num_partitions'6',where"Tbl_id> 766 andTbl_name='order_created_4_partition'");
Parameter description:
Sparksql_table_schema:spark SQL table field names and types
Jdbc_table_name: relational database table name
Jdbc_table_schema: Relational database table field name
URL: relational database URL
User: Relational database username
Password: relational database password
Num_partitions:partitions number, default is 5, can be omitted
Where: Filter conditions, can be omitted
Select from Jdbc_table;
Problems encountered during the testing process:
The code above does not have a problem connecting to the MySQL database operation, but when operating the Oracle or DB2 database, the error is as follows:
the: About: -,302 [Executor Task Launch worker-0]ERROR Logging$class:errorinchTaskCompletionListenerjava.lang.AbstractMethodError:oracle.jdbc.driver.OracleResultSetImpl.isClosed () Z at org.apache.spark.rdd.jdbcrdd$ $anon $1.Close(Jdbcrdd.scala: About) at org.apache.spark.util.NextIterator.closeIfNeeded (Nextiterator.scala: the) at org.apache.spark.rdd.jdbcrdd$ $anon $1$ $anonfun $1. Apply (Jdbcrdd.scala: in) at org.apache.spark.rdd.jdbcrdd$ $anon $1$ $anonfun $1. Apply (Jdbcrdd.scala: in) at org.apache.spark.taskcontext$ $anon $1. Ontaskcompletion (Taskcontext.scala: -) at org.apache.spark.taskcontext$ $anonfun $marktaskcompleted$1. Apply (Taskcontext.scala: the) at org.apache.spark.taskcontext$ $anonfun $marktaskcompleted$1. Apply (Taskcontext.scala:108) at Scala.collection.mutable.resizablearray$class.foreach (Resizablearray.scala: -) at Scala.collection.mutable.ArrayBuffer.foreach (Arraybuffer.scala: -) at org.apache.spark.TaskContext.markTaskCompleted (Taskcontext.scala:108) at Org.apache.spark.scheduler.ResultTask.runTask (Resulttask.scala: -) at Org.apache.spark.scheduler.Task.run (Task.scala: Wu) at Org.apache.spark.executor.executor$taskrunner.run (Executor.scala:181) at Java.util.concurrent.ThreadPoolExecutor.runWorker (Threadpoolexecutor.java:1145) at Java.util.concurrent.threadpoolexecutor$worker.run (Threadpoolexecutor.java:615) at Java.lang.Thread.run (Thread.java:744) the: About: -,302 [Executor Task Launch Worker-1]ERROR Logging$class:errorinchTaskcompletionlistener
Followed the next Jdbcrdd source code discovery, the problem is:
The Oracle driver I used in this case was Ojdbc14-10.2.0.3.jar, and I looked at some data that the Oracle implementation class did not have the method;
The issues see: https://issues.apache.org/jira/browse/SPARK-5239
Workaround:
1, upgrade the driver package;
2, temporarily shield out these two isclosed judgment method (https://github.com/apache/spark/pull/4033)
Follow-up will continue to improve the implementation of Prunedscan, Prunedfilteredscan, and now the implementation is really "ugly", do not make the first use of it.
Spark SQL External Data Sources jdbc Easy implementation