Lesson 56th: The Nature of Spark SQL and Dataframe

Last Update:2016-03-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Spark SQL and Dataframe

Spark SQL is the cause of the largest and most-watched components except spark core:

A) ability to handle all storage media and data in various formats (you can also easily extend the capabilities of Spark SQL to support more data types, such as Kudo)

b) Spark SQL pushes the computing power of the Data warehouse to a new level. Not only is the computational speed of invincibility (Spark SQL is an order of magnitude faster than shark, Shark is an order of magnitude faster than hive), especially when tungsten matures. More importantly, the computational complexity of the data warehouse has been pushed to a historic high (Spark's subsequent launch of the Dataframe allows data warehouses to use algorithms such as machine learning, graph computing, etc.) to dig into the data warehouse for deep data value.

c) Spark SQL (Dataframe,dataset) is not only the engine of the Data Warehouse, but also the engine of data mining, and more importantly, Spark SQL is the engine of scientific computing and analysis.

D) The subsequent dataframe made spark SQL a technical overlord of the Big Data computing engine (especially with the strong support of the tungsten filament program).

e) Hive+spark Sql+dataframe

1) hive is responsible for low-cost data storage

2) Spark SQL is responsible for high-speed computing

3) DataFrame is responsible for complex data mining

Second, Dataframe and Rdd

A) both R and Python have Dataframe,spark in the form of Dataframe, the biggest difference is that it is inherently distributed; you can simply think of dataframe as a distributed table, in the following form:

name	age	tel
string	int	long
string	int	long
string	int	long
string	int	long
str ing	int	long
string	int	long

The form of the RDD is as follows:

Person

The RDD does not know the properties of the data row, and Dataframe knows the column information of the data

b) The fundamental difference between RDD and Dataframe

The RDD has a record as the basic unit, and spark cannot optimize the interior details of the Rdd when dealing with the RDD, so there is no further optimization, which limits the performance of Spark SQL.

The dataframe contains the metadata information for each record, which means that the Dataframe optimization is based on the column internal optimization rather than the RDD-based row.

Iii. Spark Enterprise-class best practices

Phase 1 File System +c language processing

Phase 2 Java EE + legacy database (poor extensibility, no distributed support.) Even if some databases are distributed, but because of transactional consistency, the speed is very slow.

Phase 3 Hive Hive has limited computational power and is very slow.

Phase 4 Hive Steering Hive+spark SQL

Stage 5 Hive+spark Sql+dataframe

Stage 6 Hive+spark Sql+dataframe+dataset

This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1751056

Lesson 56th: The Nature of Spark SQL and Dataframe

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lesson 56th: The Nature of Spark SQL and Dataframe

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lesson 56th: The Nature of Spark SQL and Dataframe

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support