One, the general load and save operation
For Spark SQL Dataframe, there are common load and save operations for Dataframe that are created from whatever data source. The load operation is primarily used to load data, creating a dataframe;save operation that is primarily used to save data from Dataframe to a file.
Java version
DataFrame df = Sqlcontext.read (). Load ("Users.parquet");
Df.select ("name", "Favorite_Color"). Write (). Save ("Namesandfavcolors.parquet");
Scala version
Val df = sqlContext.read.load ("Users.parquet")
Df.select ("name", "Favorite_Color"). Write.save ("Namesandfavcolors.parquet")
Java version:
Package Swy.study.spark.sql;import Org.apache.spark.sparkconf;import Org.apache.spark.api.java.javasparkcontext;import Org.apache.spark.sql.dataframe;import Org.apache.spark.sql.SQLContext;/** * General load and save operation * @author Swy **/ Public classGenericloadsave { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); Javasparkcontext SC=Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERSDF=sqlcontext.read (). Load ("Hdfs://spark1:9000/users.parquet"); Usersdf.Select("name","Favorite_Color"). Write (). Save ("Hdfs://spark1:9000/namesandfavcolors.parquet"); } }
Manually specifying a data source type
You can also manually specify the type of data source to use for the operation. The data source usually needs to be specified with its fully qualified name, such as Parquet is Org.apache.spark.sql.parquet. But Spark SQL has built in some types of data sources, such as JSON,PARQUET,JDBC and so on. In fact, with this feature, you can convert between different types of data sources. For example, the data in the JSON file is saved to the parquet file. By default, if you do not specify a data source type, it is parquet.
Java version
DataFrame df = sqlcontext.read (). Format ("JSON"). Load ("People.json");
Df.select ("Name", "Age"). Write (). Format ("Parquet"). Save ("Namesandages.parquet");
Scala version
Val df = SqlContext.read.format ("JSON"). Load ("People.json")
Df.select ("Name", "Age"). Write.format ("Parquet"). Save ("Namesandages.parquet")
Second, the data source parquet uses the programming way to load the data
Parquet is a tabular storage format for analytic business, developed by Twitter and Cloudera, and graduated from the Apache incubator in May 2015 to become an Apache top project.
What are the advantages of columnstore versus row-type storage?
1, can skip the non-conforming data, read only the required data, reduce the amount of IO data.
2, compression encoding can reduce disk storage space. Because the same column has the same data type, you can further conserve storage space with more efficient compression encodings, such as Run Length Encoding and Delta Encoding.
3, only read the required columns, support vector operations, to obtain better scanning performance.
Case: Querying the user's name in the user's data.
PackageSwy.study.spark.sql;Importjava.util.List;Importorg.apache.spark.SparkConf;ImportOrg.apache.spark.api.java.JavaRDD;ImportOrg.apache.spark.api.java.JavaSparkContext;Importorg.apache.spark.api.java.function.Function;ImportOrg.apache.spark.sql.DataFrame;ImportOrg.apache.spark.sql.Row;ImportOrg.apache.spark.sql.SQLContext; Public classParquetloaddata { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); //. Setmaster ("local");Javasparkcontext sc =Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERDF=Sqlcontext.read (). Parquet (//"E://Swy//Resource//Workspace-neno//Spark-study-java//txt//Users.parquet ");"Hdfs://spark1:9000/spark-study/users.parquet"); Userdf.registertemptable ("Users"); DataFrame NAMEDF= Sqlcontext.sql ("Select name from Users"); List<String> names = Namedf.javardd (). Map (NewFunction<row, string>(){ Private Static Final LongSerialversionuid = 1L; PublicString Call (Row row)throwsException { return"Name;" + row.getstring (0); }}). Collect (); for(String s:names) {System.out.println (s); } }}
sparksql--loading and saving of data source parquet