Spark SQL often needs to access hive Metastore, andSpark SQL can get the Hive table's metadata through hive Metastore . Starting with Spark 1.4.0, spark SQL supports access to each version of Hive Metastore with a simple configuration. Note that spar SQL ignores the version of hive when it comes to Metastore. The internal operation of Spark SQL (SerDes, UDFs, UDAFS, etc) calls Hive 1.2.1 version of Class, which is internally deserialized into hive 1.2. Version 1 by spark SQL.
The original text is discussed with the author: http://www.cnblogs.com/intsmaze/p/6618841.html
Spark SQL and Hive share a set of meta-databases Spark SQL can also create a metabase itself, not necessarily relying on hive to create a metabase, so you do not need to start hive, as long as there is a metabase,Spark SQL can be used. However,
if you want to persist the file-to-table relationship like Hive, you can use hive instead of starting the hive program with the Hivecontext class provided by Spark. 1.
Copy the hive's hive-site.xml into the $spark-home/conf directory, which is configured with hive metastore metadata stored in the database location, if the database does not exist, of course, We can define a database, and then the program will automatically create the corresponding metabase when the spark cluster is running.
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.19.131:3306/hivedb?createDatabaseIfNotExist=true</value> </ property> <property> <name>javax.jdo.option.ConnectionDriverName</name> < value>com.mysql.jdbc.driver</value> </property> <property> <name> javax.jdo.option.connectionusername</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value> Hadoop</value> </property></configuration>
2. If HDFS is configured for high availability, copy the Hdfs-site.xml and Core-site.xml files from the Hadoop cluster to the spark/conf folder. 3. Specify MySQL connection drive location when
starting Spark-shell Spark cluster mode
Bin/spark-shell --master spark://intsmaze:7077 --executor-memory 512m --driver-class-path/home/ Intsmaze/mysql-connector-java-5.1.35-bin.jar
SPRK on Yarn mode
Bin/spark-shell --master yarn --executor-memory 512m --total-executor-cores 2 --driver-class-path/ Home/intsmaze/mysql-connector-java-5.1.35-bin.jar
4. Execute SQL statementscall HQL using Sqlcontext.sql
Val rdd=sqlcontext.sql ("Select * from Default.person limit 2")//You can now use SQL statements directly, just to specify which table of which library to query. Rdd.write.json ("Hdfs://192.168.19.131:9000/personresult")
using Org.apache.spark.sql.hive.HiveContext
Import Org.apache.spark.sql.hive.HiveContextval Hivecontext = new Hivecontext (SC) hivecontext.sql ("SELECT * FROM Default.person ")
5. Start Shell mode with the Sprk-sql command
启动spark-sql时指定mysql连接驱动位置(启动spark-sql那么就和hive的操作一样,里面可以直接写sql语句进行操作)
Bin/spark-sql--master spark://intsmaze:7077--executor-memory 512m--total-executor-cores 3--driver-class-path/home /intsmaze/mysql-connector-java-5.1.35-bin.jar
Write the SQL statement directly inside.
SELECT * FROM Default.person limit 2
How does spark SQL add data to the metadata? Because the metabase simply holds the address of the table counterpart data in HDFs and does not hold the table's data information, spark SQL can create the table, but cannot add data such as INSERT statements to the table. Note It is not a concept to store DF data in a database. 6.Thrift JDBC/ODBC Server Spark SQL implements Thrift JDBC/ODBC Server, which means that we can send SQL statements and execute them via JDBC remote connection Spark SQL like hive. before this, you need to copy the ${hive_home}/conf/hive-site.xml to the ${spark_home}/conf directory, because my HIVE is configured with metadata information stored in MySQL, So spark needs MySQL connection-driven support when accessing these metadata information. There are three ways to add drivers:
The first is the addition of the spark-defaults.conf in the ${spark_home}/conf directory: Spark.jars/intsmaze/lib/mysql-connector-java-5.1.26-bin.jar.
the second type is by adding: Spark.driver.extraClassPath /intsmaze/lib2/mysql-connector-java-5.1.26-bin.jar This method can also be implemented to add multiple dependent jars, which is more convenient.
The third is to add--jars/intsmaze/lib2/mysql-connector-java-5.1.26-bin.jar at run time.
Start Thrift
Execute under the Spark root directory:./sbin/start-thriftserver.sh turn on the thrift server.
./start-thriftserver.sh--jars/home/hadoop/mysql-connector-java-5.1.35-bin.jar--master Yarn
start-thriftserver.sh and Spark-submit are similar in usage, can accept all spark-submit parameters, and can also accept--hiveconf parameters. do not add any parameters to run in local mode, the default listening port is 10000
Test with Beelineexecute under the spark root directory:./bin/beeline
Connecting JDBC/ODBC serverbeeline>!connect jdbc:hive2://localhost:10000
after the connection will be prompted to enter a user name and password, the user name can fill the current login of the Linux user name, the password is empty.
JDBC Connection in Java codenext open eclipse with JDBC to connect Hiveserver2, the same is true for connecting to hive. Add the following dependencies in Pom.xml:
<dependency> <groupId>org.apache.hive</groupId> <artifactid>hive-jdbc</ artifactid> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.4.1</version> </dependency> <dependency> <groupId> jdk.tools</groupid> <artifactId>jdk.tools</artifactId> <version>1.6</ version> <scope>system</scope> <systempath>${java_home}/lib/tools.jar</ Systempath> </dependency>
Drive: Org.apache.hive.jdbc.HiveDriverUrl:jdbc:hive2://192.168.19.131:10000/defaultUser name: Hadoop (Linux user name starting thriftserver)Password: "" (the default password is empty)
import Java.sql.Connection; Import Java.sql.DriverManager; Import Java.sql.ResultSet; Import java.sql.SQLException; Import java.sql.Statement; public class Test1 {public static void main (string[] args) throws SQLException {String url = "jdbc:hive2:/ /192.168.19.131:10000/default "; try {class.forname ("org.apache.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) {e.printstacktrace (); } Connection conn = drivermanager.getconnection (URL, "Hadoop", ""); Statement stmt = Conn.createstatement (); String sql = "SELECT * from Personlimit 10"; ResultSet res = stmt.executequery (SQL); while (Res.next ()) {SYSTEM.OUT.PRINTLN ("ID:" +res.getint (1) + "\tname:" +res.getstring (2) + "\tage:" + res.getint ( 3)); } } }
This way, can be seen in yarn management interface, the chairman of a task, the task is responsible for running SQL statements, but not parallel run SQL statement, is the same time for two user input query statement run at the same time, must wait for a run to run the second.
Spark SQL Visualization
The first scenario:
The spark SQL code is packaged, the SQL statement and the result store location as parameters, and after the Java code collects these parameters, it is assembled as a command that invokes the script to submit the jar package to the cluster.
The second scenario:
Spark SQL implements the Thrift JDBC/ODBC server, as described in the Spark website
Finally, this article for a long time, has been edited not published, I have not engaged in a year spark, focus on Java core technology research.
3.sparkSQL Integrated Hive