Spark SQL 1.3
refer to the official documentation: Spark SQL and DataFrame Guide
Overview Introduction Reference: Approachable, inclusive--spark SQL 1.3.0 overview
DataFrame提供了
A channel that connects all the main data sources and automatically translates into a parallel processing format through which spark can delight all players on the big data ecosystem, whether it's a data scientist using R, a business analyst with SQL, or a statistical engineer who cares about efficiency and real-time performance.
In the case of a common scenario-log parsing, sometimes we need to use some extra structured data (such as IP and address mapping), usually such data will exist in MySQL, and there are two ways to access: Each worker remote to retrieve the database, the disadvantage is the cost of additional network I/O resources The second is to use JdbcRDD
the API into the RDD format, and then write complex functions to implement the retrieval, obviously to write more code. Now the spark line of code can be translated from MySQL DataFrame
, and SQL queries are supported.
In the previous article, the text format has been tested, and now the join query between the hive hbase MySQL Oracle and the temp table is tested
1. Access MySQL
In addition to JSON, DataFrame
it is now possible to support external data sources such as MySQL, Hive, HDFS, PostgreSQL, and read through the relational database jdbc
.
Bin/spark-shell--driver-class-path./lib/mysql-connector-java-5.1.24-bin.jar val sc = new Org.apache.spark.SparkContext val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) val jdbcdf = Sqlcontext.load (" JDBC ", Map (" url "," jdbc:mysql://192.168.0.110:3306/hidata?user=root&password=123456 "," dbtable "," Loadinfo ")) Bin/spark-sql--driver-class-path./lib/mysql-connector-java-5.1.24-bin.jar spark-sql> Create Temporary table Jdbcmysql using org.apache.spark.sql.jdbc options (url "Jdbc:mysql://192.168.0.110:3306/hidata? user=root&password=123456 ", dbtable" Loadinfo ") Spark-sql>select * from jdbcmysql;//Note that SRC is a table that hive already exists in spark No temporary tables in SQL, direct operation//implementation of a federated query between hive and MySQL table select * from SRC join jdbcmysql on (src.key=jdbcmysql.id);
2. Access Oracle
Similarly, but note the URL of the connection is not the same, but also tried for a long time
Bin/spark-shell--driver-class-path./lib/ojdbc6.jarval jdbcdf = sqlcontext.load ("jdbc", Map ("url", "Jdbc:oracle: Thin:kang/[email PROTECTED]:1521:ORCL "," dbtable "," TEST "))
Spark terrorize can be handy again.
The wrong URL:
Val jdbcdf = sqlcontext.load ("jdbc", Map ("url", "jdbc:oracle:thin:@192.168.0.110:1521:orcl&user=kang&") password=123456 "," dbtable "," TEST ")) Val jdbcdf = sqlcontext.load (" jdbc ", Map (" url "," jdbc:oracle:thin:@ ") 192.168.0.110:1521/orcl&user=kang&password=123456 "," dbtable "," TEST "))
Type of error: The solution that looks the most like, save it for later
Java.sql.SQLException:Io:NL Exception was generated error resolution (JDBC data source problem)
Resolve Oracle ORA-12505, Tns:listener does not currently know of SID given in Connect
The first way, will tell the unrecognized SID, in fact, when connected to the orcl&user=kang&password=123456 as its SID, in fact, close. Usually use JDBC Connection database, URL user password are separated, learn this way ^ ^
The JDBC URL of Oracle three ways: this
1. Common SID mode Jdbc:oracle:thin:username/[email PROTECTED]:1521:SID2. General ServerName Mode Jdbc:oracle:thin:username/[email PROTECTED]//X.X.X.1:1522/ABCD3. RAC Way jdbc:oracle:thin:@ (description= (address_list= (address= (protocol=tcp) (host=x.x.x.1) (PORT=1521)) (ADDRESS= ( PROTOCOL=TCP) (host=x.x.x.2) (port=1521))) (Load_balance=yes) (Connect_data= (server=dedicated) (SERVICE_NAME=xxrac) ))
See here for details
3. Accessing hive
For the relationship between Hive and Spark SQL, see
In fact, Spark SQL supports hive from the start. Spark provides a HiveContext
context that is actually SQLContext
a subclass, but in effect, sqlContext
it also supports the hive data source. As long as the Hive option is added to the deployment of spark and the existing hive-site.xml
files are moved to the $SPARK_HOME/conf
path, we can directly query the hive table containing the existing metadata using spark.
1.spark-sql Way
Spark-sql is an executable script in the Spark Bin directory that is designed to execute hive commands through this script, which is the original
hive> input instructions can be completed by spark-sql> input instructions.
Spark-sql can use the built-in hive Metadata-store, or you can use the metadata store of a hive that has already been installed independently
Configuration steps:
1. Copy the hive-site.xml of the hive's Conf directory to the Conf directory of Spark
2. Delete the time units of the hive-site.xml in the configuration of time, such as Ms,s
Error message: Exception in thread "main" Java.lang.RuntimeException:java.lang.NumberFormatException:For input string: "5s" Always thought to be an input format issue:
3. Add MySQL JDBC driver to Spark's classpath
[Email protected] bin]$./spark-sql Spark Assembly have been built with Hive, including DataNucleus jars on classpath< C1/>set spark.sql.hive.version=0.13.1
Hint to compile with 2 parameters
Recompile:./make-distribution.sh--tgz-phadoop-2.4-pyarn-dskiptests-dhadoop.version=2.4.1-phive-phive-thriftserver
The Spark-default has been specified in the
Create a table
Spark-sql> CREATE TABLE Word6 (ID int,word string) row format delimited fields terminated by ', ' stored as textfile;
ok Time taken:10.852 seconds
Import data
spark-sql> Load Data local inpath '/home/hadoop/word.txt ' into table Word6; Copying data from File:/home/hadoop/word.txt Copying file:file:/home/hadoop/word.txt Loading data to table Default.word6 Table default.word6 stats: [Numfiles=1, Numrows=0, totalsize=31, rawdatasize=0] OK
Federated queries with other data sources
SELECT * from SRC join jdbcmysql on (src.key=jdbcmysql.id);
2.spark-shell Way
4. Write dataframe data to hive partition table
When Dataframe writes data to hive, the default is the hive default database, Insertinto does not specify the parameters of the database, and uses the following method to write data to the hive table or to the partition of the Hive table. This
1. Write Dataframe data to hive table
You can see from the Dataframe class that there are several write APIs related to hive tables:
Registertemptable (tablename:string): Unit,insertinto (tablename:string): Unitinsertinto (tablename:string, Overwrite:boolean): Unitsaveastable (tablename:string, source:string, mode: [Size=13.3333320617676px]savemode, Options:map[string, String]): Unit
There are many overloaded functions that are not listed
The Registertemptable function is to create a spark temp table
The Insertinto function writes data to a table, and you can see that this function cannot specify information such as databases and partitions and cannot write directly.
Writing data to the Hive Data Warehouse must specify a database, Hive data table setup can be established on hive, or use Hivecontext.sql ("CREATE Table ....")
The following statement writes data to the specified database data table:
Case class Person (name:string,col1:int,col2:string) val sc = new Org.apache.spark.SparkContext val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext.implicits._ hivecontext.sql ("Use DataBaseName") Val data=sc.textfile ("path"). Map (X=>x.split ("\\s+")). Map (X=>person (x (0), X (1). Toint,x (2))
DATA.TODF () Insertinto ("TableName")
Create a case class to convert the data type in the RDD to a case type and then to Dataframe by TODF, and when invoking the Insertinto function, first specify the database, using Hivecontext.sql ("Use DataBaseName") statement, you can write dataframe data to the Hive data table.
2. Write dataframe data to the partition of hive specified data table
Hive data Table Setup can be established on hive, or using Hivecontext.sql ("CREATE Table ..."), the data storage format is limited when using saveastable, the default format is parquet, can be specified as JSON, If other formats are specified, use the statement to create the Hive table as much as possible.
The idea of writing data to a partitioned table is to first write the Dataframe data to the staging table, followed by the Hivecontext.sql statement to write the data to the Hive partition table. Here's how:
Case class Person (name:string,col1:int,col2:string) val sc = new Org.apache.spark.SparkContext val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC) import hivecontext.implicits._ hivecontext.sql ("Use DataBaseName") Val data = sc.textfile ("path"). Map (X=>x.split ("\\s+")). Map (X=>person (x (0), X (1). Toint,x (2)) DATA.TODF () . registertemptable ("table1") hivecontext.sql ("INSERT into table2 partition (date= ' 2015-04-02 ') Select Name,col1, Col2 from table1 ")
The Dataframe data can be written to the Hive partition table using the above method.
Spark SQL1.3 Test