is Spark sql far beyond the MPP SQL is true?

Source: Internet
Author: User

Why spark SQL goes far beyond the MPP SQL

Objective

This is not about performance, because I didn't try to compare it (as explained below), but instead try to look at a higher level, why Spark SQL is far beyond the MPP SQL.

  

Spark SQL and MPP SQL are not actually in one dimension. Nutshell

MPP SQL is a subset of Spark SQL

Spark SQL becomes a cross-domain interaction pattern

MPP SQL is a subset of Spark SQL

The technical problem to be solved by MPP SQL is the query problem of massive data. Depending on the actual scene, you can also add some modifier words, such as the second level, Ad-hoc and so on.

In the actual business

Explore business, such as KPI Multidimensional Analysis, User portrait query, data scientist mapping data, etc.

Operations such as reports (now many BI systems are basically built entirely on SQL), various operational ad hoc statistical requirements

Analysis of the business, but this will be more obvious. Obviously, the real analysis should mainly rely on some statistical classes, machine learning and other technical support

Operations, such as real-time query to view a huge amount of system logs, etc.

MPP SQL has a certain performance advantage, from Hawq,impala and so on is based on the MPP architecture. However, this is the only limitation. These features spark SQL is now covered, and the MPP SQL can do things that spark SQL has done beautifully.

Based on the full platform of Spark itself (the beautiful DataSource API and the efforts of each vendor), spark SQL can basically interface with any number of heterogeneous data sources for analysis and querying.

About performance can say two more sentences:

Thanks to the advent of some files with complex storage formats, such as Carbondata, Spark SQL has implemented a second-level query of massive data

The speed of Spark itself is getting vigorous through optimizations for projects such as tungsten (especially code generation), and JVM problems such as GC can be further reduced by off-heap.

So the difference in performance between Spark SQL and MPP SQL is also getting smaller.

Spark SQL becomes a cross-domain interaction pattern

Spark greatly enhances the interaction semantics by using DS (2.0 Unified df and DS, using a set of SQL engines), meaning you can use SQL (DS) as a unified interactive language to complete streaming, batching, interactive querying, machine learning and other common scenarios in big data areas. This is rare in any system, and it is also visible to the spark team's abstraction capabilities.

The introduction of the article is actually the author of The Spit slot spark team on the Spark Core (RDD) that layer of attention too little, so began to grumble.

Now let's go back and look at some of our common business:

Real-time Analytics class business

Exploration Business

Analysis of predictive business

Operations Reporting class Business

First of all, these businesses can be implemented using spark. Second, the Unified Interface is DS (Df/sql), and Ds/sql is a set of extremely easy-to-use and widely popular and accepted.

Of course, spark is not a step to do this, the original flow calculation and batch computing is two sets of APIs, DF and DS is also two sets of APIs, after the development, Databricks team is also actively thinking and slowly grow, after the previous accumulation, only to do this step now.

So essentially Ds/sql has become the addition to the RDD API, another set of common, unified interactive APIs that cover streaming, batch processing, interactive querying, machine learning and other big data fields. This is the first time we have achieved such a unification, and now it is only on the spark platform to achieve, it is the use of big data and learning threshold further reduced, work in the future.

RDD VS Ds/sql

Ds/sql is a set of data types first of all, the type of expression language is limited, meaning that the spark team can do better performance optimization, also means that the threshold is lower, the ease of use and performance can be well balanced.

"Wealth Hotline: 400-189-0298" Beijing Mei Yuan ( http://www.meiyuanxing.com/ ) Petrochemical Management Investment Co., Ltd. is a commitment to provide investors with professional Spot Heavy Oil Trading platform, trading account opening services, while providing investment consulting, market analysis, crude oil price inquiry, investment strategy, oil policy and other services companies. Beijing Mei Yuan Petrochemical Investment Management Co., Ltd. to spot heavy oil products trading as the main business, as well as spot heavy oil investment consulting and economic information consulting business.

is Spark sql far beyond the MPP SQL is true?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.