Spark SQL1.3 Test

Source: Internet
Author: User

Spark SQL 1.3

refer to the official documentation: Spark SQL and DataFrame Guide

Overview Introduction Reference: Approachable, inclusive--spark SQL 1.3.0 overview

  DataFrame提供了A channel that connects all the main data sources and automatically translates into a parallel processing format through which spark can delight all players on the big data ecosystem, whether it's a data scientist using R, a business analyst with SQL, or a statistical engineer who cares about efficiency and real-time performance.

In the case of a common scenario-log parsing, sometimes we need to use some extra structured data (such as IP and address mapping), usually such data will exist in MySQL, and there are two ways to access: Each worker remote to retrieve the database, the disadvantage is the cost of additional network I/O resources The second is to use JdbcRDD the API into the RDD format, and then write complex functions to implement the retrieval, obviously to write more code. Now the spark line of code can be translated from MySQL DataFrame , and SQL queries are supported.

In the previous article, the text format has been tested, and now the join query between the hive hbase MySQL Oracle and the temp table is tested

1. Access MySQL

In addition to JSON, DataFrame it is now possible to support external data sources such as MySQL, Hive, HDFS, PostgreSQL, and read through the relational database jdbc .

Bin/spark-shell--driver-class-path./lib/mysql-connector-java-5.1.24-bin.jar  val sc = new Org.apache.spark.SparkContext val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) val jdbcdf = Sqlcontext.load (" JDBC ", Map (" url "," jdbc:mysql://192.168.0.110:3306/hidata?user=root&password=123456 "," dbtable "," Loadinfo ")) Bin/spark-sql--driver-class-path./lib/mysql-connector-java-5.1.24-bin.jar spark-sql> Create Temporary table Jdbcmysql using  org.apache.spark.sql.jdbc options (url "Jdbc:mysql://192.168.0.110:3306/hidata? user=root&password=123456 ", dbtable" Loadinfo ") Spark-sql>select * from jdbcmysql;//Note that SRC is a table that hive already exists in spark No temporary tables in SQL, direct operation//implementation of a federated query between hive and MySQL table select * from SRC join jdbcmysql on (src.key=jdbcmysql.id);

2. Access Oracle

Similarly, but note the URL of the connection is not the same, but also tried for a long time

Bin/spark-shell--driver-class-path./lib/ojdbc6.jarval jdbcdf = sqlcontext.load ("jdbc", Map ("url", "Jdbc:oracle: Thin:kang/[email PROTECTED]:1521:ORCL "," dbtable "," TEST "))

Spark terrorize can be handy again.

The wrong URL:

Val jdbcdf = sqlcontext.load ("jdbc", Map ("url", "jdbc:oracle:thin:@192.168.0.110:1521:orcl&user=kang&") password=123456 "," dbtable "," TEST ")) Val jdbcdf = sqlcontext.load (" jdbc ", Map (" url "," jdbc:oracle:thin:@ ") 192.168.0.110:1521/orcl&user=kang&password=123456 "," dbtable "," TEST "))

Type of error: The solution that looks the most like, save it for later

Java.sql.SQLException:Io:NL Exception was generated error resolution (JDBC data source problem)

Resolve Oracle ORA-12505, Tns:listener does not currently know of SID given in Connect

The first way, will tell the unrecognized SID, in fact, when connected to the orcl&user=kang&password=123456 as its SID, in fact, close. Usually use JDBC Connection database, URL user password are separated, learn this way ^ ^

The JDBC URL of Oracle three ways: this

1. Common SID mode Jdbc:oracle:thin:username/[email PROTECTED]:1521:SID2. General ServerName Mode Jdbc:oracle:thin:username/[email PROTECTED]//X.X.X.1:1522/ABCD3. RAC Way jdbc:oracle:thin:@ (description= (address_list= (address= (protocol=tcp) (host=x.x.x.1) (PORT=1521)) (ADDRESS= ( PROTOCOL=TCP) (host=x.x.x.2) (port=1521))) (Load_balance=yes) (Connect_data= (server=dedicated) (SERVICE_NAME=xxrac) ))

See here for details

3. Accessing hive

For the relationship between Hive and Spark SQL, see

In fact, Spark SQL supports hive from the start. Spark provides a HiveContext context that is actually SQLContext a subclass, but in effect, sqlContext it also supports the hive data source. As long as the Hive option is added to the deployment of spark and the existing hive-site.xml files are moved to the $SPARK_HOME/conf path, we can directly query the hive table containing the existing metadata using spark.

1.spark-sql Way

Spark-sql is an executable script in the Spark Bin directory that is designed to execute hive commands through this script, which is the original

hive> input instructions can be completed by spark-sql> input instructions.

Spark-sql can use the built-in hive Metadata-store, or you can use the metadata store of a hive that has already been installed independently

Configuration steps:

1. Copy the hive-site.xml of the hive's Conf directory to the Conf directory of Spark

2. Delete the time units of the hive-site.xml in the configuration of time, such as Ms,s

Error message: Exception in thread "main" Java.lang.RuntimeException:java.lang.NumberFormatException:For input string: "5s" Always thought to be an input format issue:

3. Add MySQL JDBC driver to Spark's classpath

[Email protected] bin]$./spark-sql  Spark Assembly have been built with Hive, including DataNucleus jars on classpath< C1/>set spark.sql.hive.version=0.13.1  

Hint to compile with 2 parameters

Recompile:./make-distribution.sh--tgz-phadoop-2.4-pyarn-dskiptests-dhadoop.version=2.4.1-phive-phive-thriftserver

The Spark-default has been specified in the

Create a table

Spark-sql> CREATE TABLE Word6 (ID int,word string) row format delimited fields terminated by ', ' stored as textfile; 
   ok Time  taken:10.852 seconds  

Import data

spark-sql> Load Data local inpath '/home/hadoop/word.txt ' into table Word6;  Copying data from File:/home/hadoop/word.txt  Copying file:file:/home/hadoop/word.txt  Loading data to table Default.word6  Table default.word6 stats: [Numfiles=1, Numrows=0, totalsize=31, rawdatasize=0]  OK  

Federated queries with other data sources

SELECT * from SRC join jdbcmysql on (src.key=jdbcmysql.id);

2.spark-shell Way

  

4. Write dataframe data to hive partition table

When Dataframe writes data to hive, the default is the hive default database, Insertinto does not specify the parameters of the database, and uses the following method to write data to the hive table or to the partition of the Hive table. This

1. Write Dataframe data to hive table

You can see from the Dataframe class that there are several write APIs related to hive tables:

Registertemptable (tablename:string): Unit,insertinto (tablename:string): Unitinsertinto (tablename:string, Overwrite:boolean): Unitsaveastable (tablename:string, source:string, mode: [Size=13.3333320617676px]savemode, Options:map[string, String]): Unit

There are many overloaded functions that are not listed

The Registertemptable function is to create a spark temp table

The Insertinto function writes data to a table, and you can see that this function cannot specify information such as databases and partitions and cannot write directly.

Writing data to the Hive Data Warehouse must specify a database, Hive data table setup can be established on hive, or use Hivecontext.sql ("CREATE Table ....")

The following statement writes data to the specified database data table:

Case class Person (name:string,col1:int,col2:string)  val sc = new Org.apache.spark.SparkContext     val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC)  import hivecontext.implicits._  hivecontext.sql ("Use DataBaseName")  Val data=sc.textfile ("path"). Map (X=>x.split ("\\s+")). Map (X=>person (x (0), X (1). Toint,x (2))
DATA.TODF () Insertinto ("TableName")

Create a case class to convert the data type in the RDD to a case type and then to Dataframe by TODF, and when invoking the Insertinto function, first specify the database, using Hivecontext.sql ("Use DataBaseName") statement, you can write dataframe data to the Hive data table.

2. Write dataframe data to the partition of hive specified data table

Hive data Table Setup can be established on hive, or using Hivecontext.sql ("CREATE Table ..."), the data storage format is limited when using saveastable, the default format is parquet, can be specified as JSON, If other formats are specified, use the statement to create the Hive table as much as possible.

The idea of writing data to a partitioned table is to first write the Dataframe data to the staging table, followed by the Hivecontext.sql statement to write the data to the Hive partition table. Here's how:

Case class Person (name:string,col1:int,col2:string)  val sc = new Org.apache.spark.SparkContext     val Hivecontext = new Org.apache.spark.sql.hive.HiveContext (SC)  import hivecontext.implicits._  hivecontext.sql ("Use DataBaseName")  Val data = sc.textfile ("path"). Map (X=>x.split ("\\s+")). Map (X=>person (x (0), X (1). Toint,x (2))  DATA.TODF () . registertemptable ("table1")  hivecontext.sql ("INSERT into table2 partition (date= ' 2015-04-02 ') Select Name,col1, Col2 from table1 ")  


The Dataframe data can be written to the Hive partition table using the above method.

Spark SQL1.3 Test

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.