spark1.4 loading MySQL data create dataframe and join operation connection method issues

Source: Internet
Author: User

First we use the new API method to connect MySQL load data to create DF

ImportOrg.apache.spark.sql.DataFrameImportOrg.apache.spark. {sparkcontext, sparkconf}ImportOrg.apache.spark.sql. {savemode, DataFrame}ImportScala.collection.mutable.ArrayBufferImportOrg.apache.spark.sql.hive.HiveContextImportJava.sql.DriverManagerImportjava.sql.Connection Val SqlContext=NewHivecontext (SC) Val mysqlurl= "Jdbc:mysql://10.180.211.100:3306/appcocdb?user=appcoc&password=asia123"

Val ci_mda_sys_table = Sqlcontext.jdbc (Mysqlurl, "ci_mda_sys_table"). Cache ()

Val ci_mda_sys_table_column = Sqlcontext.jdbc (Mysqlurl, "Ci_mda_sys_table_column"). Cache ()

Val ci_label_ext_info = Sqlcontext.jdbc (Mysqlurl, "Ci_label_ext_info"). Cache ()

Val ci_label_info = Sqlcontext.jdbc (Mysqlurl, "Ci_label_info"). Cache ()

Val ci_approve_status = Sqlcontext.jdbc (Mysqlurl, "Ci_approve_status"). Cache ()

Val dim_coc_label_count_rules = Sqlcontext.jdbc (Mysqlurl, "Dim_coc_label_count_rules"). Cache ()

Associating based on multiple table IDs

Val labels = ci_mda_sys_table.join (ci_mda_sys_table_column,ci_mda_sys_table ("table_id") = = = Ci_mda_sys_table_column ("table_id"), "inner"). Cache () Labels.join (Ci_label_ext_info,ci_mda_sys_table_column ("column_id") = = = Ci_ Label_ext_info ("column_id"), "inner"). Cache () Labels.join (Ci_label_info,ci_label_ext_info ("label_id") = = = Ci_label_info ("label_id"), "inner"). Cache () Labels.join (Ci_approve_status,ci_label_info ("label_id") = = = Ci_approve_status ("resource_id"), "inner"). Cache () Labels.filter (Ci_approve_status ("Curr_approve_status _id ") = = = 107 and (Ci_label_info (" data_status_id ") = = = 1 | | Ci_label_info ("data_status_id") = = = 2) and (Ci_label_ext_info ("Count_rules_code") isnotnull) and Ci_mda_sys_table (" Update_cycle ") = = = 1). Cache ()

So the crackling error, the third join when the ID is not found, the problem is very strange ... :

Helpless. Then use the official website API spark1.4 The specified method to try

Val labels = ci_mda_sys_table.join (ci_mda_sys_table_column, "table_id") labels.join (Ci_label_ext_info," column_id ") labels.join (Ci_label_info," label_id ") Labels.join (ci_approve_status). WHERE ($"label_id" ===$ "resource_id")

So the crackling, still can't find the ID ....

Finally helpless. Just use the original method to create a soft connection, load the data, found can: I don't get it ...

Val Ci_mda_sys_table_ddl = s"""CREATE Temporary TABLE ci_mda_sys_table USING org.apache.spark.sql.jdbc OPTIONS ( URL' ${mysqlurl} ', DBTable' Ci_mda_sys_table '             )"" ". StripmarginSqlcontext.sql (CI_MDA_SYS_TABLE_DDL) Val ci_mda_sys_table= SQL ("SELECT * from Ci_mda_sys_table"). Cache ()//val ci_mda_sys_table = Sqlcontext.jdbc (Mysqlurl, "ci_mda_sys_table"). Cache ()Val Ci_mda_sys_table_column_ddl= S"""CREATE Temporary TABLE ci_mda_sys_table_column USING org.apache.spark.sql.jdbc optio NS (URL' ${mysqlurl} ', DBTable' Ci_mda_sys_table_column '            )"" ". StripmarginSqlcontext.sql (CI_MDA_SYS_TABLE_COLUMN_DDL) Val ci_mda_sys_table_column= SQL ("SELECT * from Ci_mda_sys_table_column"). Cache ()//val ci_mda_sys_table_column = Sqlcontext.jdbc (Mysqlurl, "Ci_mda_sys_table_column"). Cache ().........

The final problem is solved. But why not just load it. Still needs to be studied.

With a solution to the problem if you report such a mistake,

15/11/19 10:57:12 INFO blockmanagerinfo:removed broadcast_3_piece0 on cbg6aocdp9:49897 in memory (size:8.4 KB, free:106 0.3MB)15/11/19 10:57:12 INFO blockmanagerinfo:removed broadcast_3_piece0 on cbg6aocdp5:45978 in memory (size:8.4 KB, free:106 0.3MB)15/11/19 10:57:12 INFO blockmanagerinfo:removed broadcast_2_piece0 on 10.176.238.11:38968 in memory (size:8.2 KB, free: 4.7GB)15/11/19 10:57:12 INFO blockmanagerinfo:removed broadcast_2_piece0 on cbg6aocdp4:55199 in memory (size:8.2 KB, free:106 0.3MB)15/11/19 10:57:12 Info contextcleaner:cleaned Shuffle 015/11/19 10:57:12 INFO blockmanagerinfo:removed Broadcast_1_piec E0 on 10.176.238.11:38968 in memory (size:6.5 KB, free:4.7GB)15/11/19 10:57:12 INFO blockmanagerinfo:removed broadcast_1_piece0 on cbg6aocdp8:55706 in memory (size:6.5 KB, free:106 0.3MB) Target_table_code:========================it03exception in Thread"Main"Java.lang.RuntimeException:Error in configuring object at Org.apache.hadoop.util.ReflectionUtils.setJobConf (R Eflectionutils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf (Reflectionutils.java:75) at Org.apache.hadoop.util.ReflectionUtils.newInstance (Reflectionutils.java:133) at Org.apache.spark.rdd.HadoopRDD.getInputFormat (Hadooprdd.scala:190) at Org.apache.spark.rdd.HadoopRDD.getPartitions (Hadooprdd.scala:203) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.rdd.MapPartitionsRDD.getPartitions (Mappartitionsrdd.scala:32) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:219) at org.apache.spark.rdd.rdd$ $anonfun $partitions$2.apply (rdd.scala:217) at Scala. Option.getorelse (Option.scala:120) at Org.apache.spark.rdd.RDD.partitions (Rdd.scala:217) at Org.apache.spark.sql.execution.SparkPlan.executeTake (Sparkplan.scala:121) at Org.apache.spark.sql.execution.Limit.executeCollect (Basicoperators.scala:125) at Org.apache.spark.sql.DataFrame.collect (Dataframe.scala:1269) at Org.apache.spark.sql.DataFrame.head (Dataframe.scala:1203) at Org.apache.spark.sql.DataFrame.take (Dataframe.scala:1262) at org.apache.spark.sql.DataFrame.showString (Dataframe.scala:176) at Org.apache.spark.sql.DataFrame.show (Dataframe.scala:331) at main.asiainfo.coc.impl.indexmakerobj$ $anonfun $makeindexsandlabels$1.apply (indexmakerobj.scala:218) at main.asiainfo.coc.impl.indexmakerobj$ $anonfun $makeindexsandlabels$1.apply (indexmakerobj.scala:137) at scala.collection.indexedseqoptimized$class. foreach (indexedseqoptimized.scala:33) at Scala.collection.mutable.arrayops$ofref.foreach (Arrayops.scala:108) at Main.asiainfo.coc.impl.indexmakerobj$.makeindexsandlabels (Indexmakerobj.scala:137) at Main.asiainfo.coc.cocdss$.main (Cocdss.scala:23) at Main.asiainfo.coc.CocDss.main (Cocdss.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java:43) at Java.lang.reflect.Method.invoke (Method.java:30S) at org.apache.spark.deploy.sparksubmit$.org$apache$spark$deploy$sparksubmit$ $runMain (Sparksubmit.scala: 665) at org.apache.spark.deploy.sparksubmit$.dorunmain$1 (sparksubmit.scala:170) at Org.apache.spark.deploy.sparksubmit$.submit (Sparksubmit.scala:193) at Org.apache.spark.deploy.sparksubmit$.main (Sparksubmit.scala:112) at Org.apache.spark.deploy.SparkSubmit.main (Sparksubmit.scala) caused By:java.lang.reflect.InvocationTargetE Xception at Sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at Sun.reflect.NativeMethodAccessorI Mpl.invoke (Nativemethodaccessorimpl.java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (Delegatingmethodaccessorimpl.java:43) at Java.lang.reflect.Method.invoke (Method.java:30S) at org.apache.hadoop.util.ReflectionUtils.setJobConf (Reflectionutils.java:106)        ... 71morecaused By:java.lang.IllegalArgumentException:Compression codec Com.hadoop.compression.lzo.LzoCodec not        Found. At Org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses (Compressioncodecfactory.java:135) at Org.apache.hadoop.io.compress.CompressionCodecFactory.<init> (compressioncodecfactory.java:175) at Org.apache.hadoop.mapred.TextInputFormat.configure (Textinputformat.java:45)        ... 76morecaused By:java.lang.ClassNotFoundException:Class Com.hadoop.compression.lzo.LzoCodec not found at Org.ap Ache.hadoop.conf.Configuration.getClassByName (Configuration.java:2018) at Org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses (Compressioncodecfactory.java: 128)        ... More

At the end of the day, we know it's Hadoop data compression format for Lzo Spark to read a jar package that must be introduced into Hadoop Lzo

spark1.4 loading MySQL data create dataframe and join operation connection method issues

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.