Shark and Sparksql

Source: Internet
Author: User

First introduce the concept of shark
Shark is simply the hive on Spark, whose bottom-layer relies on the Hive engine's
But on the spark platform, the shark is much faster than hive.
It is the embodiment of hive on spark, and it is an upgrade, a powerful data warehouse, and is compatible with hive syntax.

Here is a shark frame from the Internet

As can be seen from the graph, most of the bottom of Spark is based on HDFs, the data in the shark is also corresponding to the files on the HDFs.
Green lattice can be seen in the shark of the entire structure of the HIVEQL engine or occupy the bottom of the inseparable part, and the meta-store system is the root of hive, the importance of shark naturally self-evident

The code format for creating an external partition table in Shark is as follows:
Create [external] table [if NOT EXISTS] table_name (col_name data_type,...)
[Partitioned by (Col_name data_type,...)]
[Row format Row_format]
[Terminated by ' t ']
[lines terminated by ' \ n ']
[Stored as File_format]
[Location Hdfs_path]

There's no difference between the basic and hive formats.
There is also an efficient table in shark called the cache table.
Here's how to create a cache table:
CREATE TABLE xx_cached as SELECT ...

Just add _cached at the end of the table name
The cache table, as the name implies, stores the queried data generation table in the cache, and the speed will be a geometric boost when queried again.

Usage of shark:
Use the shark script in the Spark's Bin directory to enter the client program
Shark-f the. sql file location to execute
The corresponding table can be generated after execution and can be queried using SQL statements in the client

But...
Compared to hive, this is good for performance and shark
At the beginning of the Spark1.0 version, Shark was abandoned by the official ...
Why?
The reason is that shark relies too much on hive, which makes it impossible to add new optimization strategies when performing tasks.

So the spark team decided to develop a data warehouse framework completely out of hive, based on the spark platform.
So Sparksql was born.
What are the advantages relative to Shark,sparksql?
First, the root cause of the fundamental sparksql, which is completely out of the hive limit
Second, the sparksql supports querying the native Rdd, which is extremely critical. RDD is the core concept of the spark platform and is the basis for various scenarios in which spark can efficiently handle big data.
Third, you can write SQL statements in Scala. Supports simple SQL syntax checking, the ability to write hive statements in Scala to access hive data, and retrieve the results as an RDD use
IV, Catalyst. Catalyst can help users to optimize queries, even if the user level is not high, write high-efficiency code, Catalyst can also perform a certain degree of performance optimization

Simple from the above points can be seen, compared to sparksql and shark, in terms of performance and availability has certainly improved several levels

Here is an online sparksql architecture diagram:

It is obvious that hive in Shark is becoming the top-level variable program module
And Sparksql also supports database interfaces such as JDBC/ODBC and JSON-formatted data, which is a balm

At the end of the article, we give a sparksql example code (Scala language):

ValSc:sparkcontext//Define a constant of type Sparkcontext sc,sparkcontext is the only channel for submitting jobs in SparkValSqlContext =NewSqlContext (SC)//According to SC new a SqlContext object, which is the object that handles the SparksqlImportSqlcontext._//Introduce all methods in SqlContext, which are the basis for handling SQL statementsCase   class person(name:string,age:string)//define a  person classValPeople:rdd[person] = Sc.textfile ("People.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). ToInt))//define an RDD array, type person, read data from the People.txt file to generate the RDD, according to the map operation after split, generate a corresponding person object for each row of recordsPeople.registerastable ("People")//The resulting RDD array is registered as table "People"Valteenagers = SQL ("SELECT name from people where is >= && age <=")//define the SQL statement to executeTeenagers.map (t ="Name:"+ t (0). Collect (). foreach (println)//loop print out the name of each object in teenagers

Shark and Sparksql

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.