Past life: Hive, Shark, Spark SQL

Source: Internet
Author: User
Tags base64 hadoop ecosystem


Hive (http://en.wikipedia.org/wiki/Apache_Hive) (non-strict source order translation) Apache Hive is a data Warehouse framework built on Hadoop that provides the data's profile, query, and analysis capabilities. It was originally developed by Facebook and is now being used by companies like Netflix.  Amazon maintains a branch that is customized for you. Hive provides a class-SQL voice--HIVEQL that transforms schema operations on relational databases into operations supported by the Map/reduce, Apache Tez, and spark execution engines of Hadoop. All three of the above execution engines can be run under the yarn framework.  To expedite execution, it adds the indexes feature, including the bitmap indexes. Other Features:
  • Accelerated indexing function (what's special?) )
  • Different storage type files, such as plain text, rcfile , HBase,ORC, and others.
  • The metadata is saved in the relational database, by default (ApacheDerby database), can be replaced by MySQL, etc.;
  • Compress data operations on the Hadoop ecosystem to support multiple algorithms:gzip, bzip2, snappy, etc.
  • Built-in UDF (custom function)
  • Class SQL query, which is converted to mapreduce execution.
HIVEQL is not fully compatible with the SQL-92 standard:1) It supports multirow Insert function and CREATE TABLE function through select;2) Only basic indexing function is supported;3) does not support transactional and materialized view functions;4) Only limited subquery function is supportedinside Hive, the HIVEQL statement is converted by the compiler to the MapReduce (directed acyclic graph) relationship, which is then submitted to Hadoop for execution;Related projects:
    • Apache Pig
    • Sqoop
    • Cloudera Impala
    • Apache Drill

Shark(https://github.com/amplab/shark/wiki/Shark-User-Guide) Shark is a large-scale data warehousing system designed for Spark, which is compatible with hive ... Balabala
Shark, Spark SQL, Hive on Spark, and the future of SQL on spark
Shark will stop development, and Spark SQL will replace and be compatible with shark 0.9all features and provides additional functionality.
The disadvantages of Hive:
    • poor performance;
    • In order to perform interactive queries, expensive and private data warehouses need to be deployed, and these data warehouses (EDWS) require rigorous and lengthy ETL processing.
The significant performance differences between Hive and Edws led the industry to suspect that the generic data processing engine had inherent flaws in query processing. Many people believe that interactive SQL requires an expensive professional query system (relative to the universal data engine. ) (e.g. Edws). Shark is one of the first interactive SQL tools built on a Hadoop system and is the only one built on spark.   Shark proves that hive's flaws are not inherent, and that a common data engine like spark can do it at the same time: as fast as EDW, as large as hive/mapreduce. From Shark to Spark SQL
Shark is based on the hive code and is swapped out for a partial physical execution plan of hive (by swapping out the physical execution engine part of Hive). This approach allows shark users to accelerate hive queries, but Shark inherits the large and complex code baselines of hive that make shark difficult to optimize and maintain.  As we encounter the upper limit of performance optimization and some of the complex analysis functions of integrated SQL, we find that the framework of the mapreduce design of hive limits the development of shark. Based on the above reasons, we stopped shark the development of this standalone project and turned to spark SQL. Spark SQL isas a spark component, make the most of Spark's work from scratch. This new design makes our data faster and ultimately delivers a better and more powerful tool for the user to experience. For SQL users, spark SQL provides good performance and is compatible with shark, hive.  (Performance increases by one order of magnitude). For spark users, spark SQL provides simple (narrow-waist) operations on structured data.  That's true. For advanced data analysis unifies the use of SQL (Structured Query Language) and imperative language. For a master of Open source, Spark SQL provides a new and elegant way to build a query plan. People can easily add new optimizations within this framework. We are also touched by the enthusiasm of the open source contributors ... Balabala
Hive on Spark Project (HIVE-7292) says that everyone wants hive to support the hive on spark function as soon as possible. And how good the future is. Balabala







From for notes (Wiz)

Past life: Hive, Shark, Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.