First, Spark SQL and Dataframe
Spark SQL is the cause of the largest and most-watched components except spark core:
A) ability to handle all storage media and data in various formats (you can also easily extend the capabilities of Spark SQL to support more data types, such as Kudo)
b) Spark SQL pushes the computing power of the Data warehouse to a new level. Not only is the computational speed of invincibility (Spark SQL is an order of magnitude faster than shark, Shark is an order of magnitude faster than hive), especially when tungsten matures. More importantly, the computational complexity of the data warehouse has been pushed to a historic high (Spark's subsequent launch of the Dataframe allows data warehouses to use algorithms such as machine learning, graph computing, etc.) to dig into the data warehouse for deep data value.
c) Spark SQL (Dataframe,dataset) is not only the engine of the Data Warehouse, but also the engine of data mining, and more importantly, Spark SQL is the engine of scientific computing and analysis.
D) The subsequent dataframe made spark SQL a technical overlord of the Big Data computing engine (especially with the strong support of the tungsten filament program).
e) Hive+spark Sql+dataframe
1) hive is responsible for low-cost data storage
2) Spark SQL is responsible for high-speed computing
3) DataFrame is responsible for complex data mining
Second, Dataframe and Rdd
A) both R and Python have Dataframe,spark in the form of Dataframe, the biggest difference is that it is inherently distributed; you can simply think of dataframe as a distributed table, in the following form:
name |
age |
tel |
string |
int |
long |
string |
int |
long |
string |
int |
long |
string |
int |
long |
str ing |
int |
long |
string |
int |
long |
The form of the RDD is as follows:
Person |
Person |
Person |
Person |
Person |
Person |
The RDD does not know the properties of the data row, and Dataframe knows the column information of the data
b) The fundamental difference between RDD and Dataframe
The RDD has a record as the basic unit, and spark cannot optimize the interior details of the Rdd when dealing with the RDD, so there is no further optimization, which limits the performance of Spark SQL.
The dataframe contains the metadata information for each record, which means that the Dataframe optimization is based on the column internal optimization rather than the RDD-based row.
Iii. Spark Enterprise-class best practices
Phase 1 File System +c language processing
Phase 2 Java EE + legacy database (poor extensibility, no distributed support.) Even if some databases are distributed, but because of transactional consistency, the speed is very slow.
Phase 3 Hive Hive has limited computational power and is very slow.
Phase 4 Hive Steering Hive+spark SQL
Stage 5 Hive+spark Sql+dataframe
Stage 6 Hive+spark Sql+dataframe+dataset
This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1751056
Lesson 56th: The Nature of Spark SQL and Dataframe