Lesson 56th: The Nature of Spark SQL and Dataframe

Source: Internet
Author: User

First, Spark SQL and Dataframe

Spark SQL is the cause of the largest and most-watched components except spark core:

A) ability to handle all storage media and data in various formats (you can also easily extend the capabilities of Spark SQL to support more data types, such as Kudo)

b) Spark SQL pushes the computing power of the Data warehouse to a new level. Not only is the computational speed of invincibility (Spark SQL is an order of magnitude faster than shark, Shark is an order of magnitude faster than hive), especially when tungsten matures. More importantly, the computational complexity of the data warehouse has been pushed to a historic high (Spark's subsequent launch of the Dataframe allows data warehouses to use algorithms such as machine learning, graph computing, etc.) to dig into the data warehouse for deep data value.

c) Spark SQL (Dataframe,dataset) is not only the engine of the Data Warehouse, but also the engine of data mining, and more importantly, Spark SQL is the engine of scientific computing and analysis.

D) The subsequent dataframe made spark SQL a technical overlord of the Big Data computing engine (especially with the strong support of the tungsten filament program).

e) Hive+spark Sql+dataframe

1) hive is responsible for low-cost data storage

2) Spark SQL is responsible for high-speed computing

3) DataFrame is responsible for complex data mining


Second, Dataframe and Rdd

A) both R and Python have Dataframe,spark in the form of Dataframe, the biggest difference is that it is inherently distributed; you can simply think of dataframe as a distributed table, in the following form:

name age tel
string int long
string int long
string int long
string int long
str ing int long
string int long

The form of the RDD is as follows:

Person
Person
Person
Person
Person
Person

The RDD does not know the properties of the data row, and Dataframe knows the column information of the data

b) The fundamental difference between RDD and Dataframe

The RDD has a record as the basic unit, and spark cannot optimize the interior details of the Rdd when dealing with the RDD, so there is no further optimization, which limits the performance of Spark SQL.

The dataframe contains the metadata information for each record, which means that the Dataframe optimization is based on the column internal optimization rather than the RDD-based row.


Iii. Spark Enterprise-class best practices

Phase 1 File System +c language processing

Phase 2 Java EE + legacy database (poor extensibility, no distributed support.) Even if some databases are distributed, but because of transactional consistency, the speed is very slow.

Phase 3 Hive Hive has limited computational power and is very slow.

Phase 4 Hive Steering Hive+spark SQL

Stage 5 Hive+spark Sql+dataframe

Stage 6 Hive+spark Sql+dataframe+dataset



This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1751056

Lesson 56th: The Nature of Spark SQL and Dataframe

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.