Spark-sql's Dataframe practical explanation

Source: Internet
Author: User
Tags reflection log4j

1, Dataframe Introduction:

In Spark, Dataframe is an RDD-based distributed data set, similar to the traditional database listening two-dimensional table, dataframe with the schema meta-information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type.

Similar to this

Root   |--Age:long (nullable = True)   |--Id:long (nullable = True)   |--name:string (nullable = True)  

2. Prepare to test the structured data set

People.json

{"id": 1, "name": "Ganymede", "Age": +}  {"id": 2, "name": "Lilei", "Age": +}  {"id": 3, "name": "Lily", "Age"  : {"id": 4, "name": "Hanmeimei", "Age"  : {"id": 5, "name": "Lucy", "Age": PNs}  {"id": 6, "name": "Tom", "Age": 27}  

3, through the programming way to understand Dataframe

1) manipulate the data via the Dataframe API

Import org.apache.spark.sql.SQLContext Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import Org.apache.log4j.Level Import Org.apache.log4j.Logger Object Dataframetest {def main (args:array[string]): Unit = {//Log display level Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error) Logger.getlogger ("Org.eclipse.jetty.s Erver "). SetLevel (Level.error)//Initialize Val conf = new sparkconf (). Setappname (" Dataframetest ") val sc = new S Parkcontext (conf) val sqlcontext = new SqlContext (SC) val df = SqlContext.read.json ("People.json")//View DF Data in Df.show ()//View schema Df.printschema ()//View a field df.select ("name"). Show ()//view multiple fields plus On a value Df.select (Df.col ("name"), Df.col ("age"). Plus (1)). Show ()//filter The value of a field df.filter (Df.col ("Age"). gt). Show ()//count Group A field's value Df.groupby ("Age"). Count (). Show ()//foreach handles each field return value Df.select (Df.col ("id"), Df.col ("name"), Df.col ("AGE ")). foreach {x = =//by subscript gets data println (" col1: "+ x.get (0) +", col2: "+" Name: "+ x.get (2 ) + ", col3:" + x.get (2)}}//foreachpartition handles the return values of each field, Df.select (Df.col ("id") in the usual way of production, Df.col          ("name"), Df.col ("Age")). foreachpartition {iterator = Iterator.foreach (x =//Get data by field name)      println ("ID:" + x.getas ("id") + ", Age:" + "Name:" + x.getas ("name") + ", Age:" + x.getas ("Age")})}   }  }


2) operation of the data via the registry and operation of SQL

  1. Import org.apache.spark.sql.SQLContext Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import  Org.apache.log4j.Level Import Org.apache.log4j.Logger/** * @author Administrator */Object DATAFRAMETEST2 {def      Main (args:array[string]): Unit = {Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error);        Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest2") val sc = new Sparkcontext (conf) val sqlcontext = new SQL      Context (SC) val df = SqlContext.read.json ("People.json") df.registertemptable ("People") df.show ();        Df.printschema (); View a field sqlcontext.sql ("Select name from People"). Show ()//View multiple fields Sqlcontext.sql ("Select Name,age+1 from People "). Show ()//filter The value of a field sqlcontext.sql (" Select age from People where age>=25 "). Show ()//count Group The value of a field Sqlcontext.sql ("Select Age,count (*) CNT from PEOPLE GROUP BY age "). Show ()//foreach handles each field return value Sqlcontext.sql (" Select Id,name,age from People "). foreach {x =&gt        ;        {//by subscript get data println ("col1:" + x.get (0) + ", col2:" + "Name:" + x.get (2) + ", col3:" + x.get (2)) }}//foreachpartition handles the return values of each field, which is commonly used in production sqlcontext.sql ("Select Id,name,age from People"). Foreachpa rtition {iterator = Iterator.foreach (x = =//Get data by field name println ("ID:" + x.getas ("id") +   ', Age: ' + ' name: ' + x.getas (' name ') + ', Age: ' + x.getas ("Age")})}}}

Two ways to run the results are the same, the first for the programmer, and the second for people familiar with SQL.

4. For unstructured data

People.txt

    1. 1,ganymede,32  2, Lilei,  3, Lily,  4, Hanmeimei,  5, Lucy, Notoginseng  6, WCC, 4  

1) Mapping Registration temporary table by field reflection


     Import org.apache.spark.sql.SQLContext Import org.apache.spark.SparkConf import Org.apache.spark.SparkContext Impo RT org.apache.log4j.Level Import Org.apache.log4j.Logger Import org.apache.spark.sql.types.IntegerType Import Org.apache.spark.sql.types.StructType Import Org.apache.spark.sql.types.StringType Import Org.apache.spark.sql.types.StructField Import Org.apache.spark.sql.Row/** * @author Administrator */Object Datafr      AMETEST3 {def main (args:array[string]): Unit = {Logger.getlogger ("Org.apache.spark"). SetLevel (Level.error);        Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest3") val sc = new Sparkcontext (conf) val sqlcontext = new SQL Context (SC) val people = Sc.textfile ("People.txt") val Peoplerowrdd = people.map {x = X.split (",")}.map {data = {Val id = data (0). Trim (). toint val name = data (1). Trim () Val Age = Data (2). Trim (). ToInt Row (ID, name, age)} val structtype = Structtype (Array (S Tructfield ("id", Integertype, True), Structfield ("name", StringType, True), Structfield ("Age", Integertype,        true));        Val df = sqlcontext.createdataframe (Peoplerowrdd, Structtype);   Df.registertemptable ("People") Df.show () Df.printschema ()}}

2) use Case class reflection to map the registration temporary table

 Import org.apache.spark.sql.SQLContext Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import Org.apache.log4j.Level Import org.apache.log4j.Logger Import org.apache.spark.sql.types.IntegerType Import Org.apache.spark.sql.types.StructType Import Org.apache.spark.sql.types.StringType Import Org.apache.spark.sql.types.StructField Import Org.apache.spark.sql.Row/** * @author Administrator */Object Datafr AmeTest4 {case class people (Id:int, Name:string, Age:int) def main (args:array[string]): Unit = {logger.ge      Tlogger ("Org.apache.spark"). SetLevel (Level.error);        Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.error); Val conf = new sparkconf (). Setappname ("DataFrameTest4") val sc = new Sparkcontext (conf) val sqlcontext = new SQL Context (SC) val people = Sc.textfile ("People.txt") val Peoplerdd = people.map {x = X.split (",")}.map {D ATA = {people (data (0). Trim (). ToInt, DATA (1). Trim (), data (2). Trim (). ToInt)}}//Here you need to implicitly convert an import sqlcontext.implicits._ val df =   PEOPLERDD.TODF () df.registertemptable ("People") Df.show () Df.printschema ()}

5. Summary:

Spark SQL is a module in spark that is used primarily for the processing of structured data. It provides the most core programming abstraction, which is dataframe. Spark SQL can also act as a distributed SQL query engine. One of the most important features of Spark SQL is querying data from hive.

DataFrame, which can be understood as a collection of distributed data, is organized in the form of columns. It is very similar to a table in a relational database, but the bottom layer does a lot of optimizations. Dataframe can be built from a number of sources, including: Structured data files, tables in hive, external relational databases, and RDD.

Spark-sql's Dataframe practical explanation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.