Introduction to Spark SQL data loading and saving instances, and introduction to spark instances

Source: Internet
Author: User

Introduction to Spark SQL data loading and saving instances, and introduction to spark instances

I. Detailed explanation of pre-Knowledge
Spark SQL is mainly used to operate DataFrame. DataFrame itself provides save and load operations,
Load: You can create a DataFrame,
Save: Save the data in DataFrame to a file or specify the type of the file to be read and the specific format to indicate the type of the file to be output.

Ii. Spark SQL data read/write code practice

Import org. apache. spark. sparkConf; import org. apache. spark. api. java. javaRDD; import org. apache. spark. api. java. extends parkcontext; import org. apache. spark. api. java. function. function; import org. apache. spark. SQL. *; import org. apache. spark. SQL. types. dataTypes; import org. apache. spark. SQL. types. structField; import org. apache. spark. SQL. types. structType; import java. util. arrayList; import java. util. list; public clas S SparkSQLLoadSaveOps {public static void main (String [] args) {SparkConf conf = new SparkConf (). setMaster ("local "). setAppName ("SparkSQLLoadSaveOps"); extends parkcontext SC = new extends parkcontext (conf); SQLContext = new SQLContext (SC);/*** read () is of the DataFrameReader type, load can read data */DataFrame peopleDF = sqlContext. read (). format ("json "). load ("E: \ Spark \ Sparkinstanll_package \ Big_Data_Software \ spar K-1.6.0-bin-hadoop2.6 \ examples \ src \ main \ resources \ people. json ");/*** directly operate on DataFrame * Json: a self-explanatory format. How can I determine the format when reading Json? * Scan the entire Json file. After scanning, the metadata * // specifies the append of the output file through mode. Create a new file to Append the file leledf. select ("name"). write (). mode (SaveMode. Append). save ("E: \ personNames ");}}

The source code of the read process is analyzed as follows:
1. The read method returns DataFrameReader, which is used to read data.

/***: Experimental: * Returns a [[DataFrameReader] that can be used to read data in as a [[DataFrame]. * {* sqlContext. read. parquet ("/path/to/file. parquet ") * sqlContext. read. schema (schema ). json ("/path/to/file. json ") * }}** @ group genericdata * @ since 1.4.0 */@ Experimental // create a DataFrameReader instance and obtain the DataFrameReader reference def read: DataFrameReader = new DataFrameReader (this)

2. Call format in the DataFrameReader class to indicate the format of the file to be read.

/** * Specifies the input data source format. * * @since 1.4.0 */def format(source: String): DataFrameReader = { this.source = source this}

3. Use the load method in DtaFrameReader to convert the input to DataFrame through the path.

/** * Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by * a local or distributed file system). * * @since 1.4.0 */// TODO: Remove this one in Spark 2.0.def load(path: String): DataFrame = { option("path", path).load()}

So far, the Data Reading is complete, and DataFrame is operated below.
Below is the write operation !!!

1. Call the select function in DataFrame to filter columns.

/** * Selects a set of columns. This is a variant of `select` that can only select * existing columns using column names (i.e. cannot construct expressions). * * {{{ *  // The following two are equivalent: *  df.select("colA", "colB") *  df.select($"colA", $"colB") * }}} * @group dfops * @since 1.3.0 */@scala.annotation.varargsdef select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

2. write the results to the external storage system.

/** * :: Experimental :: * Interface for saving the content of the [[DataFrame]] out into external storage. * * @group output * @since 1.4.0 */@Experimentaldef write: DataFrameWriter = new DataFrameWriter(this)

3. Specify the append mode when saving files.

/*** Specifies the behavior when data or table already exists. options include: // Overwrite is Overwrite *-'savemode. overwrite': Overwrite the existing data. // create a new file and append *-'savemode. append ': append the data. *-'savemode. ignore ': ignore the operation (I. e. no-op ). *-'savemode. errorIfExists ': default option, throw an exception at runtime. ** @ since 1.4.0 */def mode (saveMode: SaveMode): DataFrameWriter = {this. mode = saveMode this}

4. Finally, the save () method triggers the action to output the file to the specified file.

/** * Saves the content of the [[DataFrame]] at the specified path. * * @since 1.4.0 */def save(path: String): Unit = { this.extraOptions += ("path" -> path) save()}

III. The entire Spark SQL read/write process is as follows:

4. Detailed explanation of some function source code in the process

DataFrameReader. Load ()

1. Load () returns a data set of the DataFrame type. The data used is read from the default path.

/*** Returns the dataset stored at path as a DataFrame, * using the default data source configured by spark. SQL. sources. default. ** @ group genericdata * @ deprecated As of 1.4.0, replaced by 'read (). load (path )'. this will be removed in Spark 2.0. * // @ deprecated ("Use read. load (path ). this will be removed in Spark 2.0. "," 1.4.0 ") def load (path: String): DataFrame = {// read is DataFrameReader read. load (path )}

2. Track the load source code. The source code is as follows:
Method In DataFrameReader. Load () converts the input to a DataFrame through the path.

/**  * Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by * a local or distributed file system). * * @since 1.4.0 */// TODO: Remove this one in Spark 2.0.def load(path: String): DataFrame = { option("path", path).load()}

3. Tracking load source code is as follows:

/*** Loads input in as a [[DataFrame], for data sources that don't require a path (e.g. external * key-value stores ). ** @ since 1.4.0 */def load (): DataFrame = {// parse the input Source. val resolved = ResolvedDataSource (sqlContext, userSpecifiedSchema = userSpecifiedSchema, partitionColumns = Array. empty [String], provider = source, options = extraOptions. toMap) DataFrame (sqlContext, LogicalRelation (resolved. relation ))}

DataFrameReader. format ()

1. Format: Specify the file Format, which provides a huge inspiration: If the Json file Format is used, such operations can be maintained as Parquet.
When reading files, Spark SQL can specify the type of the file to be read. For example, Json, Parquet.

/** * Specifies the input data source format.Built-in options include “parquet”,”json”,etc. * * @since 1.4.0 */def format(source: String): DataFrameReader = { this.source = source //FileType this}

DataFrame. write ()

1. Create a DataFrameWriter instance

/** * :: Experimental :: * Interface for saving the content of the [[DataFrame]] out into external storage. * * @group output * @since 1.4.0 */@Experimentaldef write: DataFrameWriter = new DataFrameWriter(this)1

2. Track the source code of DataFrameWriter as follows:
Write data to the external storage system in DataFrame mode.

/** * :: Experimental :: * Interface used to write a [[DataFrame]] to external storage systems (e.g. file systems, * key-value stores, etc). Use [[DataFrame.write]] to access this. * * @since 1.4.0 */@Experimentalfinal class DataFrameWriter private[sql](df: DataFrame) {

DataFrameWriter. mode ()

1. Overwrite is used to Overwrite all the previously written data.
Append: Append an object. A common object is appended to an object, but a new object is created for an object in parquet format.

/*** Specifies the behavior when data or table already exists. options include: *-'savemode. overwrite': Overwrite the existing data. *-'savemode. append ': append the data. *-'savemode. ignore ': ignore the operation (I. e. no-op ). // default operation *-'savemode. errorIfExists ': default option, throw an exception at runtime. ** @ since 1.4.0 */def mode (saveMode: SaveMode): DataFrameWriter = {this. mode = saveMode this}

2. receive external parameters through mode matching

/** * Specifies the behavior when data or table already exists. Options include: *  - `overwrite`: overwrite the existing data. *  - `append`: append the data. *  - `ignore`: ignore the operation (i.e. no-op). *  - `error`: default option, throw an exception at runtime. * * @since 1.4.0 */def mode(saveMode: String): DataFrameWriter = { this.mode = saveMode.toLowerCase match {  case "overwrite" => SaveMode.Overwrite  case "append" => SaveMode.Append  case "ignore" => SaveMode.Ignore  case "error" | "default" => SaveMode.ErrorIfExists  case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " +   "Accepted modes are 'overwrite', 'append', 'ignore', 'error'.") } this}

DataFrameWriter. save ()

1. save saves the result to the input path.

/** * Saves the content of the [[DataFrame]] at the specified path. * * @since 1.4.0 */def save(path: String): Unit = { this.extraOptions += ("path" -> path) save()}

2. Trace the save method.

/** * Saves the content of the [[DataFrame]] as the specified table. * * @since 1.4.0 */def save(): Unit = { ResolvedDataSource(  df.sqlContext,  source,  partitioningColumns.map(_.toArray).getOrElse(Array.empty[String]),  mode,  extraOptions.toMap,  df)}

3. The source is the defaultdatasourdatasourcename of SQLConf.
Private var source: String = df. sqlContext. conf. defaultDataSourceName
The default parameter DEFAULT_DATA_SOURCE_NAME is parquet.

// This is used to set the default data sourceval DEFAULT_DATA_SOURCE_NAME = stringConf("spark.sql.sources.default", defaultValue = Some("org.apache.spark.sql.parquet"), doc = "The default data source to use in input/output.")

Some functions in DataFrame. scala are described as follows:

1. The toDF function converts RDD to DataFrame.

/** * Returns the object itself. * @group basic * @since 1.3.0 */// This is declared with parentheses to prevent the Scala compiler from treating// `rdd.toDF("1")` as invoking this toDF and then apply on the returned DataFrame.def toDF(): DataFrame = this

2. show () method: display the result

/** * Displays the [[DataFrame]] in a tabular form. For example: * {{{ *  year month AVG('Adj Close) MAX('Adj Close) *  1980 12  0.503218    0.595103 *  1981 01  0.523289    0.570307 *  1982 02  0.436504    0.475256 *  1983 03  0.410516    0.442194 *  1984 04  0.450090    0.483521 * }}} * @param numRows Number of rows to show * @param truncate Whether truncate long strings. If true, strings more than 20 characters will *       be truncated and all cells will be aligned right * * @group action * @since 1.5.0 */// scalastyle:off printlndef show(numRows: Int, truncate: Boolean): Unit = println(showString(numRows, truncate))// scalastyle:on println

The source code of tracing showString is as follows: action is triggered in showString to collect data.

/** * Compose the string representing rows for output * @param _numRows Number of rows to show * @param truncate Whether truncate long strings and align cells right */private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { val numRows = _numRows.max(0) val sb = new StringBuilder val takeResult = take(numRows + 1) val hasMoreData = takeResult.length > numRows val data = takeResult.take(numRows) val numCols = schema.fieldNames.length

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.