sparksql--loading and saving of data source parquet

Source: Internet
Author: User

One, the general load and save operation

For Spark SQL Dataframe, there are common load and save operations for Dataframe that are created from whatever data source. The load operation is primarily used to load data, creating a dataframe;save operation that is primarily used to save data from Dataframe to a file.

Java version
DataFrame df = Sqlcontext.read (). Load ("Users.parquet");
Df.select ("name", "Favorite_Color"). Write (). Save ("Namesandfavcolors.parquet");

Scala version
Val df = sqlContext.read.load ("Users.parquet")
Df.select ("name", "Favorite_Color"). Write.save ("Namesandfavcolors.parquet")

Java version:

Package Swy.study.spark.sql;import Org.apache.spark.sparkconf;import Org.apache.spark.api.java.javasparkcontext;import Org.apache.spark.sql.dataframe;import Org.apache.spark.sql.SQLContext;/** * General load and save operation * @author Swy **/ Public classGenericloadsave { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); Javasparkcontext SC=Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERSDF=sqlcontext.read (). Load ("Hdfs://spark1:9000/users.parquet"); Usersdf.Select("name","Favorite_Color"). Write (). Save ("Hdfs://spark1:9000/namesandfavcolors.parquet"); }    }

Manually specifying a data source type
You can also manually specify the type of data source to use for the operation. The data source usually needs to be specified with its fully qualified name, such as Parquet is Org.apache.spark.sql.parquet. But Spark SQL has built in some types of data sources, such as JSON,PARQUET,JDBC and so on. In fact, with this feature, you can convert between different types of data sources. For example, the data in the JSON file is saved to the parquet file. By default, if you do not specify a data source type, it is parquet.

Java version
DataFrame df = sqlcontext.read (). Format ("JSON"). Load ("People.json");
Df.select ("Name", "Age"). Write (). Format ("Parquet"). Save ("Namesandages.parquet");

Scala version
Val df = SqlContext.read.format ("JSON"). Load ("People.json")
Df.select ("Name", "Age"). Write.format ("Parquet"). Save ("Namesandages.parquet")

Second, the data source parquet uses the programming way to load the data
Parquet is a tabular storage format for analytic business, developed by Twitter and Cloudera, and graduated from the Apache incubator in May 2015 to become an Apache top project.

What are the advantages of columnstore versus row-type storage?
1, can skip the non-conforming data, read only the required data, reduce the amount of IO data.
2, compression encoding can reduce disk storage space. Because the same column has the same data type, you can further conserve storage space with more efficient compression encodings, such as Run Length Encoding and Delta Encoding.
3, only read the required columns, support vector operations, to obtain better scanning performance.

Case: Querying the user's name in the user's data.

 PackageSwy.study.spark.sql;Importjava.util.List;Importorg.apache.spark.SparkConf;ImportOrg.apache.spark.api.java.JavaRDD;ImportOrg.apache.spark.api.java.JavaSparkContext;Importorg.apache.spark.api.java.function.Function;ImportOrg.apache.spark.sql.DataFrame;ImportOrg.apache.spark.sql.Row;ImportOrg.apache.spark.sql.SQLContext; Public classParquetloaddata { Public Static voidMain (string[] args) {sparkconf conf=Newsparkconf (). Setappname ("Genericloadsave"); //. Setmaster ("local");Javasparkcontext sc =Newjavasparkcontext (conf); SqlContext SqlContext=NewSqlContext (SC); DataFrame USERDF=Sqlcontext.read (). Parquet (//"E://Swy//Resource//Workspace-neno//Spark-study-java//txt//Users.parquet ");"Hdfs://spark1:9000/spark-study/users.parquet"); Userdf.registertemptable ("Users"); DataFrame NAMEDF= Sqlcontext.sql ("Select name from Users"); List<String> names = Namedf.javardd (). Map (NewFunction<row, string>(){            Private Static Final LongSerialversionuid = 1L;  PublicString Call (Row row)throwsException {                return"Name;" + row.getstring (0);                }}). Collect ();  for(String s:names) {System.out.println (s); }    }}

sparksql--loading and saving of data source parquet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.