sparksql--loading and saving of data source parquet

Last Update:2018-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One, the general load and save operation

For Spark SQL Dataframe, there are common load and save operations for Dataframe that are created from whatever data source. The load operation is primarily used to load data, creating a dataframe;save operation that is primarily used to save data from Dataframe to a file.

Java version
DataFrame df = Sqlcontext.read (). Load ("Users.parquet");
Df.select ("name", "Favorite_Color"). Write (). Save ("Namesandfavcolors.parquet");

Scala version
Val df = sqlContext.read.load ("Users.parquet")
Df.select ("name", "Favorite_Color"). Write.save ("Namesandfavcolors.parquet")

Java version:

Package Swy.study.spark.sql;import Org.apache.spark.sparkconf;import Org.apache.spark.api.java.javasparkcontext;import Org.apache.spark.sql.dataframe;import Org.apache.spark.sql.SQLContext;/** * General load and save operation * @author Swy **/ Public classGenericloadsave { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); Javasparkcontext SC=Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERSDF=sqlcontext.read (). Load ("Hdfs://spark1:9000/users.parquet"); Usersdf.Select("name","Favorite_Color"). Write (). Save ("Hdfs://spark1:9000/namesandfavcolors.parquet"); }    }

Manually specifying a data source type
You can also manually specify the type of data source to use for the operation. The data source usually needs to be specified with its fully qualified name, such as Parquet is Org.apache.spark.sql.parquet. But Spark SQL has built in some types of data sources, such as JSON,PARQUET,JDBC and so on. In fact, with this feature, you can convert between different types of data sources. For example, the data in the JSON file is saved to the parquet file. By default, if you do not specify a data source type, it is parquet.

Java version
DataFrame df = sqlcontext.read (). Format ("JSON"). Load ("People.json");
Df.select ("Name", "Age"). Write (). Format ("Parquet"). Save ("Namesandages.parquet");

Scala version
Val df = SqlContext.read.format ("JSON"). Load ("People.json")
Df.select ("Name", "Age"). Write.format ("Parquet"). Save ("Namesandages.parquet")

Second, the data source parquet uses the programming way to load the data
Parquet is a tabular storage format for analytic business, developed by Twitter and Cloudera, and graduated from the Apache incubator in May 2015 to become an Apache top project.

What are the advantages of columnstore versus row-type storage?
1, can skip the non-conforming data, read only the required data, reduce the amount of IO data.
2, compression encoding can reduce disk storage space. Because the same column has the same data type, you can further conserve storage space with more efficient compression encodings, such as Run Length Encoding and Delta Encoding.
3, only read the required columns, support vector operations, to obtain better scanning performance.

Case: Querying the user's name in the user's data.

 PackageSwy.study.spark.sql;Importjava.util.List;Importorg.apache.spark.SparkConf;ImportOrg.apache.spark.api.java.JavaRDD;ImportOrg.apache.spark.api.java.JavaSparkContext;Importorg.apache.spark.api.java.function.Function;ImportOrg.apache.spark.sql.DataFrame;ImportOrg.apache.spark.sql.Row;ImportOrg.apache.spark.sql.SQLContext; Public classParquetloaddata { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); //. Setmaster ("local");Javasparkcontext sc =Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERDF=Sqlcontext.read (). Parquet (//"E://Swy//Resource//Workspace-neno//Spark-study-java//txt//Users.parquet ");"Hdfs://spark1:9000/spark-study/users.parquet"); Userdf.registertemptable ("Users"); DataFrame NAMEDF= Sqlcontext.sql ("Select name from Users"); List<String> names = Namedf.javardd (). Map (NewFunction<row, string>(){            Private Static Final LongSerialversionuid = 1L;  PublicString Call (Row row)throwsException {                return"Name;" + row.getstring (0);                }}). Collect ();  for(String s:names) {System.out.println (s); }    }}

sparksql--loading and saving of data source parquet

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

sparksql--loading and saving of data source parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

sparksql--loading and saving of data source parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support