Hive on Spark vs. Sparksql vs Hive on Tez

Source: Internet
Author: User

http://blog.csdn.net/wtq1993/article/details/52435563

http://blog.csdn.net/yeruby/article/details/51448188

Hive on Spark vs. Sparksql vs Hive on Tez

The previous article has been completed Sparksql,sparksql also has Thriftserver service, here say why also choose to engage in Hive-on-spark:

    • Sparksql-thriftserver all the results of all memory, fast is fast, but not enough to query the demand for large amounts of data. If you query tens of millions of of the data, Sparksql is uncertain. and Hive-on-spark In addition to the calculation with the other logic of Spark is hive, the results returned will first write HDFs, and then slowly return to the client.
    • Sparksql-thriftserver code is all rewritten in Scala, and the existing hive business is not necessarily compatible!!
    • Sparksql-thriftserver has one of the biggest advantages is that the entire server equivalent to a session of Hive-on-spark, web monitoring is beautiful and clear. and Hive-on-spark different sessions, that's the equivalent of a different application!!. (2016-4-13 20:57:23) used dynamic allocation, did not feel sparksqlthriftserver much faster.
    • Sparksql because of memory-based, and then some scheduling aspects are optimized. such as [limit]: Hive is dead count, Sparksql increment the amount of data again and again. Sparksql can do this, after all, the good data in the memory is placed.

Unlike the idea of hive and Sparksql, hive storage is HDFs, and Sparksql just takes hdfs as a persistence tool, and its data is basically memory-based.

Check the hive log, you can see the results after the return of the action of HDFs, there will be similar logs:

 .-Geneva- -  +: the: -,687INFO exec. Filesinkoperator (Utilities.java:mvFileToFinalPath (1882)) -Moving TMP Dir:hdfs://zfcluster/hive/scratchdir/hadoop/de2b263e-9601-4df7-bc38-ba932ae83f42/hive_2016-03-28_19-38-08_834_ 7914607982986605890-1/-mr-10000/.hive-staging_hive_2016-03-28_19-38-08_834_7914607982986605890-1/_ tmp.-ext-10001To:hdfs://zfcluster/hive/scratchdir/hadoop/de2b263e-9601-4df7-bc38-ba932ae83f42/hive_2016-03-28_19-38-08_834_ 7914607982986605890-1/-mr-10000/.hive-staging_hive_2016-03-28_19-38-08_834_7914607982986605890-1/-ext-10001 
    • Tez has the advantage of spark, and Tez actually does not have a big cushion advantage. Spark's buffering effect is more pronounced and can be quickly returned. For example: You check 30,000 data, Tez is to all query and then return, and Sparksql fetch 30,000 other will not forget (the effect looks like this, specifically did not see the source implementation; MD Hive-on-spark will run all).
    • Tez task buffering cannot be shared, spark is more granular and can have a process level buffer (that is, with the last computed result, loaded buffers)! For example, if you check the data record and return count at the same time, some operations are prcess_local level, this tez is not comparable!
    • Spark's log UI looks more convenient, hehe

From a single point of view, spark all-round win.

Hive on Spark vs. Sparksql vs Hive on Tez

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.