Summary of SparkSQL and DataFrame

Source: Internet
Author: User
1. DataFrame: a distributed dataset organized by named columns. It is equivalent to a table in a relational database or the dataframe Data Structure in RPython, but DataFrame has rich optimizations. Before spark1.3, the new core type is RDD-schemaRDD, Which is changed to DataFrame. Spark operates a large number of data sources and packages through DataFrame

1. DataFrame: a distributed dataset organized by named columns. In concept, it is equivalent to a table in a relational database or a data frame data structure in R/Python, but DataFrame has rich optimizations. Before spark 1.3, the new core type was RDD-schemaRDD, which is now changed to DataFrame. Spark operates a large number of data sources and packages through DataFrame

1. DataFrame
A Distributed dataset organized by named columns. In concept, it is equivalent to a table in a relational database or a data frame data structure in R/Python, but DataFrame has rich optimizations. Before spark 1.3, the new core type was RDD-schemaRDD, which is now changed to DataFrame. Spark operates a large number of data sources through DataFrame, including external files (such as json, avro, parquet, sequencefile, etc.), hive, relational database, and cassandra.

Differences between DataFrame and RDD:
RDD is based on record. spark cannot understand the internal details of the record during optimization, nor can it be further optimized to limit the performance improvement of sparkSQL. DataFrame contains metadata information for each record, dataFrame can be used to optimize internal columns.

(1) create a DataFrame
The entry point of all related functions in Spark is the SQLContext class or its subclass. To create a SQLContext, you only need a SparkContext.

Val SC: SparkContext
Val sqlContext = new org. apache. spark. SQL. SQLContext (SC)
Import sqlContext. implicits ._
Apart from a basic SQLContext, you can also create a HiveContext that supports a superset of functions supported by basic SQLContext. Its additional functions include the ability to use a more complete HiveQL analyzer to write queries to access HiveUDFs and to read data from Hive tables. With HiveContext, you do not need to enable an existing Hive, and the available SQLContext data sources are also available for HiveContext.

With SQLContext, an application can create a DataFrame from an existing RDD, hive table, or data source DataSources
Example: create a local json File

Val df = sqlContext. jsonFile ("file: // home/hdfs/people. json ")
Df. show ()
Age name
Null Michael
30 Andy
19 Justin
Df. printSchema ()
|-Age: long (nullable = true)
|-Name: string (nullable = true)

(2) DataFrame operations
DataFrame supports the RDD series of operations to filter and associate multiple tables.

Df. select ("name"). show ()
Name
Michael
Andy
Justin
Df. select (df ("name"), df ("age") + 1). show ()
Name (age + 1)
Michael null
Andy 31
Justin 20
Df. filter (df ("age")> 21). select ("name"). show ()
Name
Andy
Df. groupBy ("age"). count (). show ()
Age count
Null 1
19 1
30 1
Table join, with three equal signs
Df. join (df2, df ("name") = df2 ("name"), "left"). show ()

Df. filter ("age> 30 ")
. Join (department, df ("deptId") === department ("id "))
. GroupBy (department ("name"), "gender ")
. Values (avg (df ("salary"), max (df ("age ")))

2. Data Sources in SparkSQL

Spark SQL supports operations on various data sources through the SchemaRDD interface. A SchemaRDD can be operated as a general RDD or registered as a temporary table. Registering a SchemaRDD as a table allows you to run SQL queries on its data.
Multiple data sources that load data as SchemaRDD, including RDDs, parquent files (columnar storage), JSON datasets, and Hive tables. The following describes how to convert RDDs to schemaRDD.
(1) Use reflection inference mode
Use reflection to infer the schema of RDD that contains a specific object type ). This method is suitable for writing spark programs and you already know the mode. reflection can simplify the code. Read by reflection based on the sample name as the column name. This RDD can be implicitly converted into a SchemaRDD and then registered as a table. Tables can be used in subsequent SQL statements.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext.implicits._case class Person(name:String,age:Int)val people = sc.textFile("file:///home/hdfs/people.txt").map(_.split(",")).map(p => Person(p(0),p(1).trim.toInt)).toDF()people.registerTempTable("people")val teenagers = sqlContext.sql("SELECT name,age FROM people WHERE age>= 19 AND age <=30")teenagers.map(t => "Name:"+t(0)).collect().foreach(println)teenagers.map(t => "Name:" + t.getAs[String]("name")).collect().foreach(println)teenagers.map(_.getValueMap[Any](List("name","age"))).collect().foreach(println)

(2) programming mode
It is implemented through a programming interface construction mode and can be used on the existing RDDs. Applicable to unknown current sample mode
A SchemaRDD can be created in three steps.

Create a row RDD from the original RDD
The created mode represented by a StructType matches the row structure of the RDD created in step 1.
Apply the schema through applySchema on the row RDD

val people = sc.textFile("file:///home/hdfs/people.txt")val schemaString = "name age"import org.apache.spark.sql.Row;import org.apache.spark.sql.types.{StructType,StructField,StringType};val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,StringType,true)))val rowRDD = people.map(_.split(",")).map(p => Row(p(0),p(1).trim))val peopleSchemaRDD = sqlContext.applySchema(rowRDD,schema)peopleSchemaRDD.registerTempTable("people")val results = sqlContext.sql("SELECT name FROM people")  //DataFrame and support all the normal RDD operationsresults.map(t => "Name:"+t(0)).collect().foreach(println)

Result output

Name: Andy
Name: Justin
Name: JohnSmith
Name: Bob

3. Performance Tuning
It mainly improves performance and reduces workload by caching data in the memory or setting experiment options.
(1) cache data in memory
Spark SQL can cache tables in columnar format by calling sqlContext. cacheTable ("tableName. Spark then only browses the required columns and automatically compresses the data to reduce memory usage and garbage collection pressure.
You can also use the setConf Method on SQLContext or run the SET key = value command in SQL to configure the memory cache.
(2) configuration options
You can use options such as spark. SQL. shuffle. partitions and spark. SQL. codegen to adjust the query execution performance.

4. Others
Spark SQL also supports the interface for directly running SQL queries without writing any code. Run the following command in the Spark directory to start Spark SQL CLI.

./Bin/spark-SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.