Spark-sql's Dataframe practical explanation

Last Update:2016-05-13 Source: Internet

Author: User

Tags log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The introduction of Dataframe, one of the most important new features of Spark-1.3, is similar to the dataframe operation in the R language, making spark-sql more stable and efficient.

1, Dataframe Introduction:

In Spark, Dataframe is an RDD-based distributed data set, similar to the traditional database listening two-dimensional table, dataframe with the schema meta-information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type.

Similar to this

Root |--age:long (nullable = True) |--Id:long (nullable = True) |--name:string (nullable = True)

2. Prepare to test the structured data set

People.json

{"id": 1, "name": "Ganymede", "Age": 32} {"id": 2, "name": "Lilei", "Age": 19} {"id": 3, "name": "Lily", "Age": 25} {"id": 4, "name": "Hanmeimei", "Age": 25} {"id": 5, "name": "Lucy", "Age": 37} {"id": 6, "name": "Tom", "Age": 27}

3, through the programming way to understand Dataframe

1) manipulate the data via the Dataframe API

Import Org.apache.spark.sql.SQLContextimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport Org.apache.log4j.Levelimport org.apache.log4j.Loggerobject dataframetest {def main (args:array[string]): Unit = {//day Log display level Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error) Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (LEVEL.ERROR)//Initialize Val conf = new sparkconf (). Setappname ("Dataframetest") val sc = new Sparkcontext (conf Val sqlcontext = new SqlContext (SC) val df = SqlContext.read.json ("People.json")//View data in DF df.show ()//check See schema Df.printschema ()//View a field df.select ("name"). Show ()//view multiple fields plus a value of Df.select (Df.col ("name"), DF. Col ("Age"). Plus (1)). Show ()//filter The value of a field df.filter (Df.col ("Age"). gt). The value of a field in Show ()//count group Df.groupby ("      Age "). Count (). Show ()//foreach handles each field return value Df.select (Df.col (" id "), df.col (" name "), Df.col (" Age "). foreach {x = = {//Get Data println by subscript ("Col1: "+ x.get (0) +", col2: "+" Name: "+ x.get (2) +", col3: "+ x.get (2))}}}//foreachpartition process the return value of each field, Production is commonly used in the way Df.select (Df.col ("id"), df.col ("name"), Df.col ("Age")). foreachpartition {iterator = Iterator.foreach  (x =//Get data by field name println ("ID:" + x.getas ("id") + ", Age:" + "Name:" + x.getas ("name") + ", Age:" + X.getas ("Age"))})}}

2) operation of the data via the registry and operation of SQL

Import Org.apache.spark.sql.SQLContextimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport Org.apache.log4j.Levelimport org.apache.log4j.logger/** * @author Administrator */object DataFrameTest2 {def main (args    : array[string]): Unit = {Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error);    Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest2") val sc = new Sparkcontext (conf) val sqlcontext = new Sqlcont    Ext (SC) val df = SqlContext.read.json ("People.json") df.registertemptable ("People") df.show ();    Df.printschema ();  View a field sqlcontext.sql ("Select name from People"). Show ()//View multiple fields Sqlcontext.sql ("Select name,age+1 from people "). Show ()//filter The value of a field sqlcontext.sql (" Select age from People where age>=25 "). Show ()//count group value of a field sq Lcontext.sql ("Select Age,count (*) CNT from people group by age"). Show ()//foreach handles each field return value SqlContext.SQL ("Select Id,name,age from People"). foreach {x = =//gets data println by subscript ("col1:" + x.get (0) + ", Col2: "+" Name: "+ x.get (2) +", col3: "+ x.get (2)}}}//foreachpartition handle the return value of each field, the usual way of production sqlcontext. SQL ("Select Id,name,age from People"). foreachpartition {iterator = Iterator.foreach (x =//Get by field name)    Data println ("ID:" + x.getas ("id") + ", Age:" + "Name:" + x.getas ("name") + ", Age:" + x.getas ("Age")}) }  }}

Two ways to run the results are the same, the first for the programmer, and the second for people familiar with SQL.

4. For unstructured data

People.txt

1,ganymede,322, Lilei, 193, Lily, 254, Hanmeimei, 255, Lucy, 376, WCC, 4

1) Mapping Registration temporary table by field reflection

Import Org.apache.spark.sql.SQLContextimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport Org.apache.log4j.Levelimport Org.apache.log4j.Loggerimport Org.apache.spark.sql.types.IntegerTypeimport Org.apache.spark.sql.types.StructTypeimport Org.apache.spark.sql.types.StringTypeimport Org.apache.spark.sql.types.StructFieldimport org.apache.spark.sql.row/** * @author Administrator */object    DATAFRAMETEST3 {def main (args:array[string]): Unit = {Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error);    Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest3") val sc = new Sparkcontext (conf) val sqlcontext = new Sqlcont Ext (SC) val people = Sc.textfile ("People.txt") val Peoplerowrdd = people.map {x = X.split (",")}.map {data =&G      T {Val id = data (0). Trim (). toint val name = data (1). Trim () Val age = Data (2). Trim (). ToInt Row (i      D, name, age)}} val Structtype = Structtype (Array (Structfield ("id", Integertype, True), Structfield ("name", Stringtyp    E, True), Structfield ("Age", Integertype, true));    Val df = sqlcontext.createdataframe (Peoplerowrdd, Structtype); Df.registertemptable ("People") Df.show () Df.printschema ()}}

2) through Case ClassReflection to map the registration temporary table

Import Org.apache.spark.sql.SQLContextimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport Org.apache.log4j.Levelimport Org.apache.log4j.Loggerimport Org.apache.spark.sql.types.IntegerTypeimport Org.apache.spark.sql.types.StructTypeimport Org.apache.spark.sql.types.StringTypeimport Org.apache.spark.sql.types.StructFieldimport org.apache.spark.sql.row/** * @author Administrator */object DataFrameTest4 {case class people (Id:int, Name:string, Age:int) def main (args:array[string]): Unit = {logger.ge    Tlogger ("Org.apache.spark"). SetLevel (Level.error);    Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest4") val sc = new Sparkcontext (conf) val sqlcontext = new Sqlcont      Ext (SC) val people = Sc.textfile ("People.txt") val Peoplerdd = people.map {x = X.split (",")}.map {data =    {People (data (0). Trim (). ToInt, data (1). Trim (), data (2). Trim (). ToInt)}} //Here it is necessary to implicitly convert an import sqlcontext.implicits._ val df = peoplerdd.todf () df.registertemptable ("People") df.show () Df.printschema ()}}

5. Summary:

Spark SQL is a module in spark that is used primarily for the processing of structured data. It provides the most core programming abstraction, which is dataframe. Spark SQL can also act as a distributed SQL query engine. One of the most important features of Spark SQL is querying data from hive.

DataFrame, which can be understood as a collection of distributed data, is organized in the form of columns. It is very similar to a table in a relational database, but the bottom layer does a lot of optimizations. Dataframe can be built from a number of sources, including: Structured data files, tables in hive, external relational databases, and RDD.

Spark-sql's Dataframe practical explanation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More