A preliminary talk on Dataframe programming model with Spark SQL

Source: Internet
Author: User

Spark SQL provides the processing of structured data on the spark core, and in the Spark1.3 version, spark SQL not only serves as a distributed SQL query engine, but also introduces a new Dataframe programming model.

In the Spark1.3 release, Spark SQL is no longer an alpha version, and new component Dataframe is introduced in addition to providing better SQL standard compatibility. At the same time, theSpark SQL data Source API also enables interaction with the new component Dataframe, allowing users to generate dataframe directly from hive tables, parquet files, and some other data sources. Users can mix SQL and dataframe operators on the same dataset. The new version also provides the ability to read and write tables from JDBC, which natively supports Postgres, MySQL, and other RDBMS systems.

  The entry point for all Spark SQL features is SqlContext, or one of its subclasses. Only one Sparkcontext instance is needed to build a basic sqlcontext.

Package Cn.spark.study.sql

Import org.apache.spark.SparkConf
Import Org.apache.spark.SparkContext
Import Org.apache.spark.sql.SQLContext

/**
* @author Administrator
*/
Object Dataframecreate {

def main (args:array[string]) {
Val conf = new sparkconf ()
. Setappname ("Dataframecreate")
. Setmaster ("local");
Val sc = new Sparkcontext (conf)
Val sqlcontext = new SqlContext (SC)

Val df = SqlContext.read.json ("Hdfs://spark1:9000/students.json")//From HDFs

Val df = SqlContext.read.json ("./data/people.json")
Val df = SqlContext.read.json ("./data/aa.json")
Spark Dataframe trial Sledgehammer, see 1190000002614456
Spark Dataframes Getting Started Guide: Creating and manipulating Dataframe, see http://blog.csdn.net/lw_ghy/article/details/51480358
Creating a dataframe is the encapsulation of data and data elements together to form a data table.
Spark dataframe usage, see http://blog.csdn.net/dreamer2020/article/details/51284789
The change of Rdd and Dataframe, see http://www.cnblogs.com/namhwik/p/5967910.html

Df.show ()

}

}

Input
{"Name": "Michael"}
{"Name": "Andy", "Age": 30}
{"Name": "Justin", "Age": 19}


Output
//+----+-------+
//| age| name|
//+----+-------+
|null| michael|
//| 30| andy|
//| 19| justin|
//+----+-------+


Input
{"Name": "China", "provinces": [{"Name": "Heilongjiang", "citys": ["Jiamusi", "Daqing", "Harbin", "Qiqihar", "Mudanjiang"]},{"name": "Liaoning", "Citys": ["Shenyang", "Dalian", "Panjin"]},{"name": "Jilin", "Citys": ["Jilin", "Changchun", "Siping"]}]}


Output
//+----+--------------------+
|name| provinces|
//+----+--------------------+
//| China | [[Wrappedarray (Jia mu ... |
//+----+--------------------+

Again, Spark-shell, in addition to helping us build the SqlContext exception, also helped us to import the implicit conversion: Import sqlcontext.implicits._. In an application submitted in Spark-submit, you need to manually import the implicit conversion to access some APIs.

The Dataframe programming model greatly simplifies the programming complexity of Spark SQL.

Spark SQL allows spark to perform relational queries in the SQL language, HIVEQL language, or Scala language. Before Spark1.3, the core of the module was the Schemardd type. Schemardd consists of row objects that describe the data type of each column in a row by scheme.

In Spark1.3, the dataframe is introduced to rename the Schemardd type , and in Spark1.3,Dataframe is a distributed dataset organized as a named column. is conceptually similar to a table in a relational database , and is equivalent to the DTA Frames in R/python. Dataframe can be converted from a structured data file, from a table in hive, or from an external database or an existing RDD.

  The Dataframe programming model has the following features :

1, from KB to petabytes of data volume support

2. Multiple data formats and multiple storage system support

3. Advanced optimization with Spark SQL Catalyst Optimizer to generate code

4, Bits python, Java, Scala, and R language (Spark R) provide APIs.

A preliminary talk on Dataframe programming model with Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.