Apache Spark Source code reading: 13-hiveql on spark implementation

Source: Internet
Author: User
Tags hdfs dfs hadoop mapreduce hadoop fs

You are welcome to reprint it. Please indicate the source.

Summary

The SQL module was added to the newly released spark 1.0. What's more interesting is that hiveql in hive also provides good support, as a source code analysis control, it is very interesting to know how spark supports hql.

Introduction to hive

The following part is taken from hive in hadoop definite guide.

"Hive was designed by Facebook to allow SQL-savvy analysts to analyze and query massive datasets stored on HDFS.

Hive greatly simplifies the analysis threshold for large-scale datasets (without requiring strong programming skills) and becomes popular in the hadoop generation circle.Killer ApplicationAt present, many organizations have adopted hive as a general and Scalable Data Processing Platform."

Data Model)

All hive data is stored in HDFS, and hive has the following data models:

  • Tables)Table corresponds to a table in a relational database. Each table has a corresponding HDFS directory.SerializationAnd then stored in this directory. Hive also supports storing table data in other types of file systems, such as NFS or local file systems.
  • Partition)The partition in hive plays a role similar to the index function in RDBMS. Each partition has a corresponding directory, which can reduce the data size during query.
  • Bucket (buckets)Even after data is partitioned, the size of each partition may be large. In this case, data is divided into Multiple Buckets Based on the hash result of the keyword. Each bucket corresponds to a file.
Query Language

Hiveql is a query language supported by hive similar to SQL. Hiveql can be divided into the following two types:

  1. DDL (Data Definition Language)For example, create database, create table, and delete databases and tables.
  2. DML (data manipulation language)Add and query data
  3. UDF (User Defined Function)Hive also supports user-defined query Functions
Hive Architecture

Shows the overall hive framework.

 

It can be seen that the overall hive architecture can be divided into the following parts:

  1. User interfaces support CLI, JDBC, and Web UI
  2. Driver driver is responsible for converting USER Command Translation into corresponding mapreduce job
  3. MetaStore metadata storage warehouse, such as the definition of databases and tables, which belongs to the category of metadata. By default, the Derby storage engine is used.
Hiveql Execution Process

The hiveql execution process is described as follows:

  1. Parser parses hiveql into the corresponding syntax tree
  2. Semantic analyser Semantic Analysis
  3. Generate the corresponding logicalplan
  4. Query Plan Generating
  5. Optimizer

Finally, a mapreduce job is generated and delivered to the hadoop mapreduce computing framework for specific operation.

Hive instance

The best way to learn is practice. The Hive section ends with a specific example.

The prerequisite is that hadoop has been installed. For details about the installation, refer to the source code for reading 11 or 9.

Step 1: Create a warehouse

Warehouse is used to store raw data

$ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp$ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
Step 2: Start hive CLI
$ export HIVE_HOME=
Step 3: Create a table

Create a tableSchemaWrite DataMetaStoreThe other thing is to create a subdirectory under the warehouse directory named after the table name.

CREATE TABLE u_data (  userid INT,  movieid INT,  rating INT,  unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘STORED AS TEXTFILE;
Step 4: import data

The imported data is stored in the table directory created in step 3.

LOAD DATA LOCAL INPATH ‘/u.data‘OVERWRITE INTO TABLE u_data;
Step 5: Query
SELECT COUNT(*) FROM u_data;
Hiveql on spark

Q:The previous chapter took a lot of space to introduce the origin and framework of hive and the implementation process of hiveql. What is the relationship between these things and hive on spark in our title?

Ans:The overall hive solution is good, but there are some points worth improving, one of which is"It takes a long time to submit the query to return the result, and the query takes too long.".One of the main reasons for the long query time is that hive is based on mapreduce and there is no way to improve it. You must have thought,"Instead of generating mapreduce jobs, spark jobs are generated.", Making full use of Spark's fast execution capability to shorten hiveql response time.

It is the Lib Library Supported by Spark 1.0, and SQL is the only newly added lib library. It can be seen that SQL plays an important role in Spark 1.0.

 

Hivecontext

Hivecontext is a user interface provided by spark. hivecontext inherits from sqlcontext.

Let's review the relationships between classes involved in sqlcontext, as shown in. For detailed analysis, see 11 of source code reading in this series.

Since it is inherited from sqlcontext, we can compare the General SQL and hiveql analysis execution steps.

 

With the above comparison, we can grasp several key points that need to be grasped during source code analysis.

  1. Entrypoint hivecontext. Scala
  2. Queryexecution hivecontext. Scala
    1. Parser hiveql. Scala
    2. Optimizer
Data

There are two types of data used:

  1. Schema data is like database definition and table structure, which are stored in MetaStore.
  2. Raw data is the file to be analyzed.
Entrypoint

Hiveql is the entire entry point, while hql is short for hiveql.

  def hiveql(hqlQuery: String): SchemaRDD = {    val result = new SchemaRDD(this, HiveQl.parseSql(hqlQuery))    // We force query optimization to happen right away instead of letting it happen lazily like    // when using the query DSL.  This is so DDL commands behave as expected.  This is only    // generates the RDD lineage for DML queries, but does not perform any execution.    result.queryExecution.toRdd    result  }

The hiveql definition is almost the same as that of SQL. The only difference is thatParsesqlThe result is used as the input parameter of schemardd in hiveql.Hiveql. parsesqlAs the input parameter of schemardd

Hiveql, Parser

The parsesql function definition is shown in the Code. commands are divided into two categories during parsing.

  • Nativecommand is a non-select statement. This type of statement is characterized by that the execution time is not significantly different because of different conditions, and can basically be completed within a short period of time.
  • Non-nativecommand is mainly a SELECT statement.
def parseSql(sql: String): LogicalPlan = {    try {      if (sql.toLowerCase.startsWith("set")) {        NativeCommand(sql)      } else if (sql.toLowerCase.startsWith("add jar")) {        AddJar(sql.drop(8))      } else if (sql.toLowerCase.startsWith("add file")) {        AddFile(sql.drop(9))      } else if (sql.startsWith("dfs")) {        DfsCommand(sql)      } else if (sql.startsWith("source")) {        SourceCommand(sql.split(" ").toSeq match { case Seq("source", filePath) => filePath })      } else if (sql.startsWith("!")) {        ShellCommand(sql.drop(1))      } else {        val tree = getAst(sql)        if (nativeCommands contains tree.getText) {          NativeCommand(sql)        } else {          nodeToPlan(tree) match {            case NativePlaceholder => NativeCommand(sql)            case other => other          }        }      }    } catch {      case e: Exception => throw new ParseException(sql, e)      case e: NotImplementedError => sys.error(        s"""          |Unsupported language features in query: $sql          |${dumpTree(getAst(sql))}        """.stripMargin)    }  }

Which commands are nativecommand? The answer is in hiveql. Scala.NativecommandsVariable. The list is long and the code is not listed one by one.

For non-nativecommand, the most important parsing function isNodetoplan

Tordd

Spark's optimization of hiveql is mainly reflected in query-related operations. Others still use hive's native execution engine.

The most critical element of tordd during the conversion from logicalplan to physicalplan

override lazy val toRdd: RDD[Row] =      analyzed match {        case NativeCommand(cmd) =>          val output = runSqlHive(cmd)          if (output.size == 0) {            emptyResult          } else {            val asRows = output.map(r => new GenericRow(r.split("\t").asInstanceOf[Array[Any]]))            sparkContext.parallelize(asRows, 1)          }        case _ =>          executedPlan.execute().map(_.copy())      }
Native Command Execution Process

Because native command is a non-time-consuming operation, you can directly use the original exeucte engine in hive to execute it. TheseCommandThe execution is as follows:

Analyzer

Hivetypecoercion

val typeCoercionRules =    List(PropagateTypes, ConvertNaNs, WidenTypes, PromoteStrings, BooleanComparisons, BooleanCasts,      StringToIntegralCasts, FunctionArgumentConversion)
Optimizer

PreinsertioncastsThe purpose is to ensure that the corresponding table already exists before the data is inserted and executed.

override lazy val optimizedPlan =      optimizer(catalog.PreInsertionCasts(catalog.CreateTables(analyzed)))

Note the usage of catalog. catalog isHivemetastorecatalog.

Hivemetastorecatalog is the wrapper for accessing hive MetaStore in spark. HivemetastorecatalogHive APIYou can obtain the partitions of tables and tables in the database, or create new tables and partitions.

Hivemetastorecatalog

Hivemetastorecatalog uses the hive client to access metadata in MetaStore and uses a large number of hive APIs. This includes the well-known deser library.

The createtable function is used as an example to describe the dependency on hive library.

def createTable(      databaseName: String,      tableName: String,      schema: Seq[Attribute],      allowExisting: Boolean = false): Unit = {    val table = new Table(databaseName, tableName)    val hiveSchema =      schema.map(attr => new FieldSchema(attr.name, toMetastoreType(attr.dataType), ""))    table.setFields(hiveSchema)    val sd = new StorageDescriptor()    table.getTTable.setSd(sd)    sd.setCols(hiveSchema)    // TODO: THESE ARE ALL DEFAULTS, WE NEED TO PARSE / UNDERSTAND the output specs.    sd.setCompressed(false)    sd.setParameters(Map[String, String]())    sd.setInputFormat("org.apache.hadoop.mapred.TextInputFormat")    sd.setOutputFormat("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")    val serDeInfo = new SerDeInfo()    serDeInfo.setName(tableName)    serDeInfo.setSerializationLib("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")    serDeInfo.setParameters(Map[String, String]())    sd.setSerdeInfo(serDeInfo)    try client.createTable(table) catch {      case e: org.apache.hadoop.hive.ql.metadata.HiveException        if e.getCause.isInstanceOf[org.apache.hadoop.hive.metastore.api.AlreadyExistsException] &&           allowExisting => // Do nothing.    }  }
Lab

Combined with the source code, we will describe a simple example.

You may think that since spark also supports hql, can spark be used to access the databases and tables created with hive CLI? The answer may be confusing to you,"It cannot be configured by default ". Why?

Meta data in hive uses the storage engine Derby, which can only be accessed by one user. At the same time, only one person can access the service, even if you log on to the same user. To address this limitation, the solution is to store MetaStore in MySQL or other databases that can be accessed by multiple users.

Specific instance

  1. Create a table
  2. Import Data
  3. Query
  4. Delete table

Before starting spark-shell, you must set the environment variable.Hive_homeAndHadoop_home.

After spark-shell is started, run the following code:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)// Importing the SQL context gives access to all the public SQL functions and implicit conversions.import hiveContext._hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")hql("LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt‘ INTO TABLE src")// Queries are expressed in HiveQLhql("FROM src SELECT key, value").collect().foreach(println)hql("drop table src")

The create operation will/User/hive/warehouse/Directory to create the src directory. You can use the following command to verify

$$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/

When you drop a table,Not only is the corresponding record in MetaStore deleted, but the raw file itself is also deleted.That is, the Directory of the corresponding table under the warehouse directory will be deleted as a whole.

The effect of the preceding create, load, and query operations on MetaStore and raw data can be expressed as follows:

Hive-site.xml

If you want to modify the default hive configuration, you can use the hive-site.xml.

The procedure is as follows:

-Create a hive-site.xml under the $ spark_home/conf directory

-Add the values of corresponding configuration items as needed.$ Hive_home/ConfDirectoryHive-default.xmlCopy to $ spark_home/conf and rename itHive-site.xml

New SQL Functions

To further improve the SQL Execution speed, the spark development team will use the codegen method to increase the execution speed after the release of 1.0. Codegen is somewhat similar to JIT Technology in JVM. Fully utilizes the features of scala.

Foreground Analysis

Spark still lacks a very influential application, also known as killer application. SQL is an active attempt by spark to find the killer application. It is also the most popular topic in spark,However, by optimizing hive execution speed to attract potential spark users, the selection of the breakthrough direction is still to be proved by the market.

In addition to being criticized for execution speed, hive also has the biggest problem of multi-user access. The second problem is even more fatal than the first one. Both presto launched by Facebook after hive and Impala launched by cloudera are solutions to the second problem, which have already achieved great advantages.

Summary

This article analyzes in detail the features supported by hiveql, which involves the following issues.

  1. What is hive?
  2. What are the disadvantages of hive? otherwise there will be no spark or shark issues.
  3. Spark is mainly used to improve the hive deficiency.
  4. How does spark improve this?
References
  1. Programming hive
  2. Shark Vs. Impala
  3. Hive Design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.