Difference between spark hive and sparkhive

Source: Internet
Author: User

Difference between spark hive and sparkhive

I. What is Hive essentially?

1: hive is a distributed, data warehouse, and query engine. Spark SQL is only a part of the HIVE query engine. enterprises generally use Hive + spark SQL for development.

2: main work of hive

1> translate HQL into long map-reduce code, and many mapreduce jobs may be generated.

2> package the generated Mapreduce Code and related resources into jar and release them to the Hadoop cluster for running.

3: hive Architecture


4: by default, hive uses derby to store metadata. Therefore, in the production environment, multi-user databases are generally used for metadata storage and read/write splitting and backup. Generally, the master node is used for writing, read from a node, usually using mysql

5. Specific storage of hive data warehouse data

Ii. SparkSQL and DataFrame

1: process all storage media and data in various formats (sparksql can be extended to read more types of data)

2: Spark SQLComputing speedPushed to a new height (TunstenMore powerful after maturity)

3: Spark SQL releasedDataframeAllowsData WarehouseDirectly use machine learning, graph computing, and other complex algorithms

4: HIVE + Spark SQL + DataFrame:

I> Hive: responsible for low-cost data warehouse storage

Ii> Spark SQL: responsible for high-speed computing

Iii> DataFrame: Responsible for complex data mining

Iii. DataFrame and RDD

1: DataFrame is a distributed table

2: fundamental differences between RDD and DataFrame

1. RDD is in Record units,

2. DataFrame contains the Metadata information of each Record. That is to say, DataFrame optimization is based on column optimization and RDD is based on Row optimization.






Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.