Spark SQL data loading and saving instance explanation _mssql

Source: Internet
Author: User
Tags deprecated require

First, the knowledge of the prior detailed
Spark SQL is important in that the operation Dataframe,dataframe itself provides save and load operations.
Load: You can create Dataframe,
Save: Saves the data in the Dataframe to a file, or to a specific format, indicating the type of file we want to read and what type of file we want to output with the specific format.

Second, Spark SQL read and write data code combat

Import org.apache.spark.SparkConf;
Import Org.apache.spark.api.java.JavaRDD;
Import Org.apache.spark.api.java.JavaSparkContext;
Import org.apache.spark.api.java.function.Function;
Import org.apache.spark.sql.*;
Import Org.apache.spark.sql.types.DataTypes;
Import Org.apache.spark.sql.types.StructField;

Import Org.apache.spark.sql.types.StructType;
Import java.util.ArrayList;

Import java.util.List; public class Sparksqlloadsaveops {public static void main (string[] args) {sparkconf conf = new sparkconf (). Setmaster (
  "Local"). Setappname ("Sparksqlloadsaveops");
  Javasparkcontext sc = new Javasparkcontext (conf);
  SqlContext = new SqlContext (SC); /** * Read () is a dataframereader type, load can read the data out * * dataframe PEOPLEDF = sqlcontext.read (). Format ("JSON"). Load ("E:\\SP Ark\\sparkinstanll_package\\big_data_software\\spark-1.6.0-bin-hadoop2.6\\examples\\src\\main\\resources\\

  People.json ");
   /** * Direct operation of the dataframe * JSON: is a self explanatory format, how do you judge the format of JSON when you read it? * By scanning the entire JSON.
The metadata is not known until after the scan.   ///////////////////= Append
 Creates a new file to append the file Peopledf.select ("name"). Write (). Mode (Savemode.append). Save ("E:\\personnames");

 }
}

The Read process source code analysis is as follows:
1. The Read method returns Dataframereader for reading data.

/**
 *:: Experimental::
 * Returns a [[Dataframereader]] that can is used to read data in as a [[[Dataframe]].
 * {{
 *  sqlContext.read.parquet ("/path/to/file.parquet")
 *  SqlContext.read.schema (Schema). JSON ( "/path/to/file.json")
 *}} *
 *
 @group genericdata
 * @since 1.4.0 *
* * @Experimental
//Create Dataframereader instance, get Dataframereader reference
def read:dataframereader = new Dataframereader (this)

2. Then call the format in the Dataframereader class to indicate how to read the file.

/**
 * Specifies the input data source format.
 *
 * @since 1.4.0
/def format (source:string): Dataframereader = {
 This.source = source
 this< c25/>}

3. Through the Dtaframereader load method through the path to the incoming input into dataframe.

/**
 * Loads input in as a [[Dataframe]], for data sources this require a path (e.g. data backed by
 * a local or di stributed file system).
 *
 * @since 1.4.0 *//Todo:remove This one in
Spark 2.0.
def load (path:string): Dataframe = {
 option ("Path", path). Load ()
}

At this point, the data read work is completed, the following on the operation of the Dataframe.
Here is the write operation!!!

1. Call the Select function in Dataframe to filter the columns

/**
 * Selects a set of columns. This is a variant of ' select ' which can only select
 * Existing columns using column names (i.e. cannot construct expre ssions).
 *
 * {{
 *  //The following two are equivalent:
 *  df.select ("Cola", "Colb")
 *  Df.select ($ "cola", $ "colb")
 *}}
 * @group dfops
 * @since 1.3.0 *
Scala.annotation.varargs
def Select (col:string, cols:string*): Dataframe = Select (col +: cols). Map (Column (_)): _ *)

2. Then write the results to the external storage system.

/**
 *:: Experimental::
 * Interface for saving the content of the [[Dataframe]] out into external storage.
 * *
 @group output
 * @since 1.4.0 * *
@Experimental
def write:dataframewriter = new Dataframewriter ( This

3. Mode specifies how to append files while maintaining the file

/**
 * Specifies the behavior when data or table already exists. Options include:
//Overwrite is overlay
 *  -' savemode.overwrite ': Overwrite the existing data.
Create a new file and append
 *  -' savemode.append ': Append the data.
 *  -' savemode.ignore ': Ignore the operation (i.e. no-op).
 *  -' savemode.errorifexists ': Default option, throw an exception at runtime.
 *
 * @since 1.4.0
/def mode (savemode:savemode): Dataframewriter = {
 This.mode = Savemode
 This
}

4. Finally, the Save () method triggers the action to output the file to the specified file.

/**
 * Saves the content of the [[Dataframe]] at the specified path.
 *
 * @since 1.4.0
 /
def Save (path:string): unit = {
 This.extraoptions + = ("path"-> path)
 Save () c8/>}

Three, Spark SQL read and write the entire flowchart as follows

Four, for the process of part of the function source code detailed

Dataframereader.load ()

1. Load () returns the data collection of the Dataframe type, using data that is read from the default path.

/**
 * Returns the dataset stored at path as a dataframe,
 * Using the default data source configured by Spark.sql . Sources.default.
 * *
 @group genericdata
 * @deprecated as of 1.4.0, replaced by ' read (). Load (path). This is removed in Spark 2.0.
 * *
@deprecated ("Use Read.load (path)." This is removed in Spark 2.0. ", 1.4.0"
def Load (path:string): Dataframe = {
//At this time the read is dataframereader
   read.load (path)
}

2. Tracking the load source code in, the source code is as follows:
The method in the Dataframereader. Load () passes the input through the path into a dataframe.

/** 
 * Loads input in as a [[Dataframe]], for data sources this require a path (e.g. data backed by
 * a local or di stributed file system).
 *
 * @since 1.4.0 *//Todo:remove This one in
Spark 2.0.
def load (path:string): Dataframe = {
 option ("Path", path). Load ()
}

3. Tracking the load source code is as follows:

/**
 * Loads input in as a [[Dataframe]], for data sources that don ' t require a path (e.g. external
 * Key-value s tores).
 *
 * @since 1.4.0
/def load (): Dataframe = {
//parsing of incoming source
 val resolved = Resolveddatasource (
  sqlcontext,
  userspecifiedschema = Userspecifiedschema,
  partitioncolumns = Array.empty[string],
  Provider = source,
  options = extraoptions.tomap)
 dataframe (SqlContext, Logicalrelation (resolved.relation))
}

Dataframereader.format ()

1. Format: Specify the file format, which is a huge revelation: if the JSON file format can be maintained as parquet and other such operations.
Spark SQL can specify the type of file to read when reading files. For example, Json,parquet.

/**
 * Specifies the input data source format. Built-in options include "parquet", "json", etc.
 *
 * @since 1.4.0
/def format (source:string): Dataframereader = {
 This.source = source//filetype
 This
}

Dataframe.write ()

1. Create a Dataframewriter instance

/**
 *:: Experimental::
 * Interface for saving the content of the [[Dataframe]] out into external storage.
   
    *
 * @group output
 * @since 1.4.0
 /
@Experimental
def write:dataframewriter = new Dataframewriter (this)
1

   

2. Tracing Dataframewriter Source code is as follows:
Writes data to an external storage system in a dataframe manner.

/**
 *:: Experimental::
 * Interface used to write a [[dataframe]] to external storage (systems. File e.g,
 * Key-value stores, etc). use [[Dataframe.write]] to access this.
 *
 * @since 1.4.0 * *
@Experimental
final class Dataframewriter Private[sql] (df:dataframe) {

Dataframewriter.mode ()

1. The overwrite is covered, the data previously written are all covered.
Append: is append, for normal files are appended in a file, but for files in parquet format create new files to append.

/**
 * Specifies the behavior when data or table already exists. Options include:
 *  -' savemode.overwrite ': Overwrite the existing data.
 *  -' savemode.append ': Append the data.
 *  -' savemode.ignore ': Ignore the operation (i.e. no-op).
Default Action
 *  -' savemode.errorifexists ': Default option, throw an exception at runtime.
 *
 * @since 1.4.0
/def mode (savemode:savemode): Dataframewriter = {
 This.mode = Savemode
 This
}

2. Receive external parameters through pattern matching

/**
 * Specifies the behavior when data or table already exists. Options include:
 *  -' overwrite ': Overwrite the existing data.
 *  -' append ': Append the data.
 *  -' ignore ': ignore the operation (i.e. no-op).
 *  -' ERROR ': Default option, throw an exception at runtime.
 *
 * @since 1.4.0
/def mode (savemode:string): Dataframewriter = {
 This.mode = Savemode.tolowercase Match {case
  "overwrite" => savemode.overwrite case
  "append" => savemode.append< C17/>case "Ignore" => Savemode.ignore case
  "Error" | "Default" => savemode.errorifexists case
  _ => throw new IllegalArgumentException (S "Unknown Save mode: $ Savemode. "+
   " accepted modes are ' overwrite ', ' append ', ' Ignore ', ' error '. "
 }
 This
}

Dataframewriter.save ()

1. Save saves the results to the path that is passed in.

/**
 * Saves the content of the [[Dataframe]] at the specified path.
 *
 * @since 1.4.0
 /
def Save (path:string): unit = {
 This.extraoptions + = ("path"-> path)
 Save () c8/>}

2. Trace the Save method.

/**
 * Saves the content of the [[Dataframe]] as the specified table.
 *
 * @since 1.4.0
 /
def Save (): unit = {
 Resolveddatasource (
  df.sqlcontext,
  source,
  Partitioningcolumns.map (_.toarray). Getorelse (Array.empty[string]),
  mode,
  Extraoptions.tomap,
  DF)
}

3. Where source is Sqlconf's defaultdatasourcename
private var source:string = Df.sqlContext.conf.defaultDataSourceName
Where the default_data_source_name default parameter is parquet.

This is used to set the default data source
val default_data_source_name = stringconf ("Spark.sql.sources.default", C2/>defaultvalue = Some ("Org.apache.spark.sql.parquet"),
 doc = "The default data source to use in Input/output.")

Dataframe.scala part of the function in the detailed:

1. The TODF function is to convert RDD into Dataframe

/**
 * Returns the object itself.
 * @group Basic
 * @since 1.3.0 *//This is
 declared with parentheses to
prevent the Scala compiler from Trea Ting
//' Rdd.todf ("1") ' as invoking this todf and then apply on the returned dataframe.
Def todf (): Dataframe = This

2. Show () method: Display the results

/**
 * Displays the [[Dataframe]] in a tabular form. For example:
 * {{
 * year  month AVG (' Adj close ') MAX (' ADJ close)
 *  1980  0.503218    0.595103
 *  1981  0.523289    0.570307
 *
 1982 0.436504 *  1983  0.410516    0.442194
 * 1984 0.450090 0.483521 *}}
 * @param NumRows number of rows to show
 * @param truncate Whether truncate long strings. If true, strings more than characters'll * truncated and all cells'll be       aligned right
 *
 * * Group Action
 * @since 1.5.0
 *
//Scalastyle:off println
def show (Numrows:int, Truncate:boolean): Unit = println (showstring (NumRows, truncate))
//Scalastyle:on println

Tracking showstring source code is as follows: Showstring triggers the action to collect data.

/**
 * Compose The string representing rows for output
 * @param _numrows # of rows to show
 * @param trunca Te Whether truncate long strings and align cells right
 *
/PRIVATE[SQL] def showstring (_numrows:int, Truncate:boo Lean = true): String = {
 val numrows = _numrows.max (0)
 val sb = new StringBuilder
 val takeresult = Take (numrow S + 1)
 val hasmoredata = takeresult.length > NumRows
 val data = Takeresult.take (numrows)
 val Numcols = SC Hema.fieldNames.length

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.