3.sparkSQL Integrated Hive

Source: Internet
Author: User
Tags stmt metabase

Spark SQL often needs to access hive Metastore, andSpark SQL can get the Hive table's metadata through hive Metastore . Starting with Spark 1.4.0, spark SQL supports access to each version of Hive Metastore with a simple configuration. Note that spar SQL ignores the version of hive when it comes to Metastore. The internal operation of Spark SQL (SerDes, UDFs, UDAFS, etc) calls Hive 1.2.1 version of Class, which is internally deserialized into hive 1.2. Version 1 by spark SQL.

The original text is discussed with the author: http://www.cnblogs.com/intsmaze/p/6618841.html

Spark SQL and Hive share a set of meta-databases  Spark SQL can also create a metabase itself, not necessarily relying on hive to create a metabase, so you do not need to start hive, as long as there is a metabase,Spark SQL can be used. However, if you want to persist the file-to-table relationship like Hive, you can use hive instead of starting the hive program with the Hivecontext class provided by Spark. 1. Copy the hive's hive-site.xml into the $spark-home/conf directory, which is configured with hive metastore metadata stored in the database location, if the database does not exist, of course, We can define a database, and then the program will automatically create the corresponding metabase when the spark cluster is running.
<configuration>        <property>                <name>javax.jdo.option.ConnectionURL</name>                <value>jdbc:mysql://192.168.19.131:3306/hivedb?createDatabaseIfNotExist=true</value>        </ property>        <property>                <name>javax.jdo.option.ConnectionDriverName</name>                < value>com.mysql.jdbc.driver</value>        </property>        <property>                <name> javax.jdo.option.connectionusername</name>                <value>root</value>        </property>        <property>                <name>javax.jdo.option.ConnectionPassword</name>                <value> Hadoop</value>        </property></configuration>
2. If HDFS is configured for high availability, copy the Hdfs-site.xml and Core-site.xml files from the Hadoop cluster to the spark/conf folder. 3. Specify MySQL connection drive location when starting Spark-shell Spark cluster mode
Bin/spark-shell  --master spark://intsmaze:7077  --executor-memory 512m  --driver-class-path/home/ Intsmaze/mysql-connector-java-5.1.35-bin.jar

SPRK on Yarn mode

Bin/spark-shell  --master yarn  --executor-memory 512m  --total-executor-cores 2 --driver-class-path/ Home/intsmaze/mysql-connector-java-5.1.35-bin.jar
4. Execute SQL statementscall HQL using Sqlcontext.sql
Val rdd=sqlcontext.sql ("Select * from Default.person limit 2")//You can now use SQL statements directly, just to specify which table of which library to query. Rdd.write.json ("Hdfs://192.168.19.131:9000/personresult")
 using Org.apache.spark.sql.hive.HiveContext  
Import Org.apache.spark.sql.hive.HiveContextval Hivecontext = new Hivecontext (SC) hivecontext.sql ("SELECT * FROM Default.person ")
5. Start Shell mode with the Sprk-sql command   启动spark-sql时指定mysql连接驱动位置(启动spark-sql那么就和hive的操作一样,里面可以直接写sql语句进行操作)
Bin/spark-sql--master spark://intsmaze:7077--executor-memory 512m--total-executor-cores 3--driver-class-path/home /intsmaze/mysql-connector-java-5.1.35-bin.jar
Write the SQL statement directly inside.
SELECT * FROM Default.person limit 2
  How does spark SQL add data to the metadata? Because the metabase simply holds the address of the table counterpart data in HDFs and does not hold the table's data information, spark SQL can create the table, but cannot add data such as INSERT statements to the table. Note It is not a concept to store DF data in a database. 6.Thrift JDBC/ODBC Server Spark SQL implements Thrift JDBC/ODBC Server, which means that we can send SQL statements and execute them via JDBC remote connection Spark SQL like hive. before this, you need to copy the ${hive_home}/conf/hive-site.xml to the ${spark_home}/conf directory, because my HIVE is configured with metadata information stored in MySQL, So spark needs MySQL connection-driven support when accessing these metadata information.    There are three ways to add drivers:

The first is the addition of the spark-defaults.conf in the ${spark_home}/conf directory: Spark.jars/intsmaze/lib/mysql-connector-java-5.1.26-bin.jar.

the second type is by adding: Spark.driver.extraClassPath /intsmaze/lib2/mysql-connector-java-5.1.26-bin.jar This method can also be implemented to add multiple dependent jars, which is more convenient.

The third is to add--jars/intsmaze/lib2/mysql-connector-java-5.1.26-bin.jar at run time.

Start Thrift

Execute under the Spark root directory:./sbin/start-thriftserver.sh turn on the thrift server.

./start-thriftserver.sh--jars/home/hadoop/mysql-connector-java-5.1.35-bin.jar--master Yarn

start-thriftserver.sh and Spark-submit are similar in usage, can accept all spark-submit parameters, and can also accept--hiveconf parameters. do not add any parameters to run in local mode, the default listening port is 10000

Test with Beelineexecute under the spark root directory:./bin/beeline
Connecting JDBC/ODBC serverbeeline>!connect jdbc:hive2://localhost:10000
after the connection will be prompted to enter a user name and password, the user name can fill the current login of the Linux user name, the password is empty. JDBC Connection in Java codenext open eclipse with JDBC to connect Hiveserver2, the same is true for connecting to hive. Add the following dependencies in Pom.xml:
<dependency>          <groupId>org.apache.hive</groupId>          <artifactid>hive-jdbc</ artifactid>          <version>1.2.1</version>  </dependency>     <dependency>          <groupId>org.apache.hadoop</groupId>          <artifactId>hadoop-common</artifactId>         <version>2.4.1</version>  </dependency>    <dependency>         <groupId> jdk.tools</groupid>         <artifactId>jdk.tools</artifactId>         <version>1.6</ version>         <scope>system</scope>         <systempath>${java_home}/lib/tools.jar</ Systempath>     </dependency>  
Drive: Org.apache.hive.jdbc.HiveDriverUrl:jdbc:hive2://192.168.19.131:10000/defaultUser name: Hadoop (Linux user name starting thriftserver)Password: "" (the default password is empty)
import  Java.sql.Connection;  Import Java.sql.DriverManager;  Import Java.sql.ResultSet;  Import java.sql.SQLException;  Import java.sql.Statement; public class Test1 {public static void main (string[] args) throws SQLException {String url = "jdbc:hive2:/          /192.168.19.131:10000/default ";          try {class.forname ("org.apache.hive.jdbc.HiveDriver");          } catch (ClassNotFoundException e) {e.printstacktrace ();          } Connection conn = drivermanager.getconnection (URL, "Hadoop", "");          Statement stmt = Conn.createstatement ();          String sql = "SELECT * from Personlimit 10";          ResultSet res = stmt.executequery (SQL); while (Res.next ()) {SYSTEM.OUT.PRINTLN ("ID:" +res.getint (1) + "\tname:" +res.getstring (2) + "\tage:" + res.getint (          3)); }      }  } 

This way, can be seen in yarn management interface, the chairman of a task, the task is responsible for running SQL statements, but not parallel run SQL statement, is the same time for two user input query statement run at the same time, must wait for a run to run the second.

Spark SQL Visualization

The first scenario:
The spark SQL code is packaged, the SQL statement and the result store location as parameters, and after the Java code collects these parameters, it is assembled as a command that invokes the script to submit the jar package to the cluster.

The second scenario:
Spark SQL implements the Thrift JDBC/ODBC server, as described in the Spark website

Finally, this article for a long time, has been edited not published, I have not engaged in a year spark, focus on Java core technology research.

 

3.sparkSQL Integrated Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.