A: What is hive essence?
1:hive is a distributed and data warehouse, but also the query engine, Spark SQL is just the replacement hive query engine part of the enterprise generally use Hive+spark SQL for development
The main work of 2:hive
1> hql translates long map-reduce code and can generate a lot of mapreduce job
2> Package The MapReduce code and related resources into a jar and publish it to a Hadoop cluster and run it.
3:hive Architecture
4:hive By default, the metadata is stored in Derby, so in the production environment will generally use a multi-user database for metadata storage, and can read and write separation and backup, generally use the master node to write, read from the node, the general use of MySQL
5:hive specific storage of data warehouse data
II: Sparksql and Dataframe
1: Handle all storage media and data in various formats (can extend Sparksql to read more types of data)
2:spark SQL pushes the computational speed of the Data Warehouse to a new height (tungsten is more powerful after maturity)
3:spark SQL launches the Dataframe can let the data warehouse directly use machine learning, graph calculation and other complex algorithms
4:hive+spark Sql+dataframe:
I> Hive: Responsible for low-cost Data Warehouse storage
Ii>spark SQL: Responsible for high-speed computing
Iii> DataFrame: Responsible for complex data mining
Three: Dataframe and Rdd
The 1:dataframe is a distributed table
The fundamental differences between 2:rdd and Dataframe
1.RDD is in the record unit,
2.DataFrame contains the metadata information for each record, which means that the optimization of Dataframe is column-based optimization, and the RDD is row-based optimization
Spark Hive Differences