The SQL module was added to the newly released spark 1.0. What's more interesting is that hiveql in hive also provides good support, as a source code analysis control, it is very interesting to know how spark supports hql.
The following part is taken from hive in hadoop definite guide.
Hiveql is a query language supported by hive similar to SQL. Hiveql can be divided into the following two types:
Shows the overall hive framework.
It can be seen that the overall hive architecture can be divided into the following parts:
Finally, a mapreduce job is generated and delivered to the hadoop mapreduce computing framework for specific operation.
The best way to learn is practice. The Hive section ends with a specific example.
The prerequisite is that hadoop has been installed. For details about the installation, refer to the source code for reading 11 or 9.
Step 3: Create a table Create a tableSchemaWrite DataMetaStoreThe other thing is to create a subdirectory under the warehouse directory named after the table name.
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘\t‘STORED AS TEXTFILE;
Step 4: import data The imported data is stored in the table directory created in step 3.
LOAD DATA LOCAL INPATH ‘/u.data‘OVERWRITE INTO TABLE u_data;
Step 5: Query SELECT COUNT(*) FROM u_data;
Hiveql on spark Q:The previous chapter took a lot of space to introduce the origin and framework of hive and the implementation process of hiveql. What is the relationship between these things and hive on spark in our title?
Ans:The overall hive solution is good, but there are some points worth improving, one of which is"It takes a long time to submit the query to return the result, and the query takes too long.".One of the main reasons for the long query time is that hive is based on mapreduce and there is no way to improve it. You must have thought,"Instead of generating mapreduce jobs, spark jobs are generated.", Making full use of Spark's fast execution capability to shorten hiveql response time.
It is the Lib Library Supported by Spark 1.0, and SQL is the only newly added lib library. It can be seen that SQL plays an important role in Spark 1.0.
Hivecontext Hivecontext is a user interface provided by spark. hivecontext inherits from sqlcontext.
Let's review the relationships between classes involved in sqlcontext, as shown in. For detailed analysis, see 11 of source code reading in this series.
Since it is inherited from sqlcontext, we can compare the General SQL and hiveql analysis execution steps.
With the above comparison, we can grasp several key points that need to be grasped during source code analysis.
- Entrypoint hivecontext. Scala
- Queryexecution hivecontext. Scala
- Parser hiveql. Scala
- Optimizer
Data There are two types of data used:
- Schema data is like database definition and table structure, which are stored in MetaStore.
- Raw data is the file to be analyzed.
Entrypoint Hiveql is the entire entry point, while hql is short for hiveql.
def hiveql(hqlQuery: String): SchemaRDD = { val result = new SchemaRDD(this, HiveQl.parseSql(hqlQuery)) // We force query optimization to happen right away instead of letting it happen lazily like // when using the query DSL. This is so DDL commands behave as expected. This is only // generates the RDD lineage for DML queries, but does not perform any execution. result.queryExecution.toRdd result }
The hiveql definition is almost the same as that of SQL. The only difference is thatParsesqlThe result is used as the input parameter of schemardd in hiveql.Hiveql. parsesqlAs the input parameter of schemardd
Hiveql, Parser The parsesql function definition is shown in the Code. commands are divided into two categories during parsing.
- Nativecommand is a non-select statement. This type of statement is characterized by that the execution time is not significantly different because of different conditions, and can basically be completed within a short period of time.
- Non-nativecommand is mainly a SELECT statement.
def parseSql(sql: String): LogicalPlan = { try { if (sql.toLowerCase.startsWith("set")) { NativeCommand(sql) } else if (sql.toLowerCase.startsWith("add jar")) { AddJar(sql.drop(8)) } else if (sql.toLowerCase.startsWith("add file")) { AddFile(sql.drop(9)) } else if (sql.startsWith("dfs")) { DfsCommand(sql) } else if (sql.startsWith("source")) { SourceCommand(sql.split(" ").toSeq match { case Seq("source", filePath) => filePath }) } else if (sql.startsWith("!")) { ShellCommand(sql.drop(1)) } else { val tree = getAst(sql) if (nativeCommands contains tree.getText) { NativeCommand(sql) } else { nodeToPlan(tree) match { case NativePlaceholder => NativeCommand(sql) case other => other } } } } catch { case e: Exception => throw new ParseException(sql, e) case e: NotImplementedError => sys.error( s""" |Unsupported language features in query: $sql |${dumpTree(getAst(sql))} """.stripMargin) } }
Which commands are nativecommand? The answer is in hiveql. Scala.NativecommandsVariable. The list is long and the code is not listed one by one.
For non-nativecommand, the most important parsing function isNodetoplan
Tordd Spark's optimization of hiveql is mainly reflected in query-related operations. Others still use hive's native execution engine.
The most critical element of tordd during the conversion from logicalplan to physicalplan
override lazy val toRdd: RDD[Row] = analyzed match { case NativeCommand(cmd) => val output = runSqlHive(cmd) if (output.size == 0) { emptyResult } else { val asRows = output.map(r => new GenericRow(r.split("\t").asInstanceOf[Array[Any]])) sparkContext.parallelize(asRows, 1) } case _ => executedPlan.execute().map(_.copy()) }
Native Command Execution Process Because native command is a non-time-consuming operation, you can directly use the original exeucte engine in hive to execute it. TheseCommandThe execution is as follows:
Analyzer Hivetypecoercion
val typeCoercionRules = List(PropagateTypes, ConvertNaNs, WidenTypes, PromoteStrings, BooleanComparisons, BooleanCasts, StringToIntegralCasts, FunctionArgumentConversion)
Optimizer PreinsertioncastsThe purpose is to ensure that the corresponding table already exists before the data is inserted and executed.
override lazy val optimizedPlan = optimizer(catalog.PreInsertionCasts(catalog.CreateTables(analyzed)))
Note the usage of catalog. catalog isHivemetastorecatalog.
Hivemetastorecatalog is the wrapper for accessing hive MetaStore in spark. HivemetastorecatalogHive APIYou can obtain the partitions of tables and tables in the database, or create new tables and partitions.
Hivemetastorecatalog Hivemetastorecatalog uses the hive client to access metadata in MetaStore and uses a large number of hive APIs. This includes the well-known deser library.
The createtable function is used as an example to describe the dependency on hive library.
def createTable( databaseName: String, tableName: String, schema: Seq[Attribute], allowExisting: Boolean = false): Unit = { val table = new Table(databaseName, tableName) val hiveSchema = schema.map(attr => new FieldSchema(attr.name, toMetastoreType(attr.dataType), "")) table.setFields(hiveSchema) val sd = new StorageDescriptor() table.getTTable.setSd(sd) sd.setCols(hiveSchema) // TODO: THESE ARE ALL DEFAULTS, WE NEED TO PARSE / UNDERSTAND the output specs. sd.setCompressed(false) sd.setParameters(Map[String, String]()) sd.setInputFormat("org.apache.hadoop.mapred.TextInputFormat") sd.setOutputFormat("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat") val serDeInfo = new SerDeInfo() serDeInfo.setName(tableName) serDeInfo.setSerializationLib("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") serDeInfo.setParameters(Map[String, String]()) sd.setSerdeInfo(serDeInfo) try client.createTable(table) catch { case e: org.apache.hadoop.hive.ql.metadata.HiveException if e.getCause.isInstanceOf[org.apache.hadoop.hive.metastore.api.AlreadyExistsException] && allowExisting => // Do nothing. } }
Lab Combined with the source code, we will describe a simple example.
You may think that since spark also supports hql, can spark be used to access the databases and tables created with hive CLI? The answer may be confusing to you,"It cannot be configured by default ". Why?
Meta data in hive uses the storage engine Derby, which can only be accessed by one user. At the same time, only one person can access the service, even if you log on to the same user. To address this limitation, the solution is to store MetaStore in MySQL or other databases that can be accessed by multiple users.
Specific instance
- Create a table
- Import Data
- Query
- Delete table
Before starting spark-shell, you must set the environment variable.Hive_homeAndHadoop_home.
After spark-shell is started, run the following code:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)// Importing the SQL context gives access to all the public SQL functions and implicit conversions.import hiveContext._hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")hql("LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt‘ INTO TABLE src")// Queries are expressed in HiveQLhql("FROM src SELECT key, value").collect().foreach(println)hql("drop table src")
The create operation will/User/hive/warehouse/Directory to create the src directory. You can use the following command to verify
$$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/
When you drop a table,Not only is the corresponding record in MetaStore deleted, but the raw file itself is also deleted.That is, the Directory of the corresponding table under the warehouse directory will be deleted as a whole.
The effect of the preceding create, load, and query operations on MetaStore and raw data can be expressed as follows:
Hive-site.xml If you want to modify the default hive configuration, you can use the hive-site.xml.
The procedure is as follows:
-Create a hive-site.xml under the $ spark_home/conf directory
-Add the values of corresponding configuration items as needed.$ Hive_home/ConfDirectoryHive-default.xmlCopy to $ spark_home/conf and rename itHive-site.xml
New SQL Functions To further improve the SQL Execution speed, the spark development team will use the codegen method to increase the execution speed after the release of 1.0. Codegen is somewhat similar to JIT Technology in JVM. Fully utilizes the features of scala.
Foreground Analysis Spark still lacks a very influential application, also known as killer application. SQL is an active attempt by spark to find the killer application. It is also the most popular topic in spark,However, by optimizing hive execution speed to attract potential spark users, the selection of the breakthrough direction is still to be proved by the market.
In addition to being criticized for execution speed, hive also has the biggest problem of multi-user access. The second problem is even more fatal than the first one. Both presto launched by Facebook after hive and Impala launched by cloudera are solutions to the second problem, which have already achieved great advantages.
Summary This article analyzes in detail the features supported by hiveql, which involves the following issues.
- What is hive?
- What are the disadvantages of hive? otherwise there will be no spark or shark issues.
- Spark is mainly used to improve the hive deficiency.
- How does spark improve this?
References
- Programming hive
- Shark Vs. Impala
- Hive Design