Difference between spark hive and sparkhive
I. What is Hive essentially?
1: hive is a distributed, data warehouse, and query engine. Spark SQL is only a part of the HIVE query engine. enterprises generally use Hive + spark SQL for development.
2: main work of hive
1> translate HQL into long map-reduce code, and many mapreduce jobs may be generated.
2> package the generated Mapreduce Code and related resources into jar and release them to the Hadoop cluster for running.
3: hive Architecture
4: by default, hive uses derby to store metadata. Therefore, in the production environment, multi-user databases are generally used for metadata storage and read/write splitting and backup. Generally, the master node is used for writing, read from a node, usually using mysql
5. Specific storage of hive data warehouse data
Ii. SparkSQL and DataFrame
1: process all storage media and data in various formats (sparksql can be extended to read more types of data)
2: Spark SQLComputing speedPushed to a new height (TunstenMore powerful after maturity)
3: Spark SQL releasedDataframeAllowsData WarehouseDirectly use machine learning, graph computing, and other complex algorithms
4: HIVE + Spark SQL + DataFrame:
I> Hive: responsible for low-cost data warehouse storage
Ii> Spark SQL: responsible for high-speed computing
Iii> DataFrame: Responsible for complex data mining
Iii. DataFrame and RDD
1: DataFrame is a distributed table
2: fundamental differences between RDD and DataFrame
1. RDD is in Record units,
2. DataFrame contains the Metadata information of each Record. That is to say, DataFrame optimization is based on column optimization and RDD is based on Row optimization.