Spark SQL and DataFrame Guide (1.4.1)--The data Sources

Source: Internet
Author: User
Tags hadoop fs

DataSource (Data Sources)

Spark SQL supports multiple data source operations through the Dataframe interface. A dataframe can be used as a normal rdd operation, or it can be registered as a temporary table.

1. General-Purpose Load/save functions
The default data source applies to all actions (default values can be set with Spark.sql.sources.default)

After that, we can hadoop fs -ls /user/hadoopuser/ find the Namesandfavcolors.parquet file using this directory.

    • Specify data source options manually
      We can specify the data source option manually, using a fully qualified name (for example, Org.apache.spark.sql.parquet), but you can also use the short name (JSON,PARQUET,JDBC) for the built-in data source.
df = sqlContext.read.load("examples/src/main/resources/people.json", format="json")df.select("name""age").write.save("namesAndAges.parquet", format="parquet")
    • storage mode (save Modes)
      The Save action can select Savemode to specify how the data that already exists is handled. However, these save mode does not use any locking mechanism and is not atomic. Therefore, it is not safe to write to one position at a time by multiple writers. In addition, when the overwrite is executed, the data is deleted before the new data is written.
in sql
scala/java python meaning
savemode.errorifexists (default) TD align= "left" to save Dataframe to the data source, throws an error when the data already exists
savemode.append Appen D "
Savemode.ov Erwrite
savemode.ignore "Ignore" ignore mode means that the DAT is saved Aframe to a data source, the data in Dataframe is not saved when the data or table already exists. That does not change the original data; a bit like CREATE TABLE IF not EXISTS
    • storing into a persisted table
      When using Hivecontext, Dataframes can be saved as a persisted table, using the saveastable command instead of the registertemptable command. Saveastable will materialize the contents of the Dataframe and create a pointer to the data in Hivemetastore. As long as you're connected to the same metastore, the persistence table will still exist after your spark program restarts.
      A table method that invokes the name of the tables as a parameter in SqlContext can create a dataframe of a persisted table. Here we use the first to create a temporary table under simulation:

      The result of the comparison is True clearly
      Typically saveastable will create a "managed table" so that the location of the data will be Metastore controlled, and managed table will delete all data after a table has been dropped.

2.Parquet Files
Parquet is a columnstore format, and Spark SQL supports reading and writing parquet files and automatically saving the schema of the original data.

    • Program Load Data
# SqlContext From the previous example are used in this example.Schemapeople# The DataFrame from the previous example.# dataframes can be saved as parquet files, maintaining the schema information.SchemaPeople.write.parquet ("People.parquet")# Read in the Parquet file created above. Parquet files is self-describing so the schema is preserved.# The result of loading a parquet file is also a DataFrame.Parquetfile = SqlContext.read.parquet ("People.parquet")# Parquet files can also is registered as tables and then used in SQL statements.Parquetfile.registertemptable ("Parquetfile"); teenagers = Sqlcontext.sql ("SELECT name from Parquetfile, WHERE age >=, and age <=") Teennames = Teenagers.map (LambdaP:"Name:"+ p.name) forTeennameinchTeennames.collect ():PrintTeenname
    • Discovery Partition
      Partitioning a table is a common optimization method in a system like hive. In a partitioned table, the data is stored in a different directory, using the value of the partition column to encode the path of each partition directory. Parquet data sources can automatically discover and infer partition information. For example, we can store the previously used data into a partitioned table, using the following directory structure, with two additional columns gender and country as the partitioning column:
path└── to    └── table        ├── gender=male        │   ...        │   │        │   ├── country=US        │   │   └── data.parquet        │   ├── country=CN        │   │   └── data.parquet        │   ...        └── gender=female            ...            │            ├── country=US            │   └── data.parquet            ├── country=CN            │   └── data.parquet            ...

Using SQLContext.read.parquet or SQLContext.read.load entering path path/to/table, Spark SQL can automatically extract partition information from the path. The schema of the returned Dataframe becomes:

stringtruelongtruestringtruestringtrue)

Note: The data type of a partitioned column is automatically inferred and currently supports only string and numeric types.

    • Merging schemas
      Like Protocolbuffer, Avro, Thrift, Parquet supports schema evolution. The user can initially create a simple schema and gradually add more columns as needed. This will eventually result in multiple parquet files with different but mutually compatible schemas. The Parquet data source can now automatically detect this situation and merge schema for all files.
 # SqlContext From the previous example are used in this example.# Create A simple DataFrame, stored into a partition directoryDF1 = Sqlcontext.createdataframe (sc.parallelize (Range (1,6). Map (LambdaI:row (single=i, Double=i *2))) Df1.save ("Data/test_table/key=1","Parquet")# Create Another DataFrame in a new partition directory,# Adding a new column and dropping an existing columnDF2 = Sqlcontext.createdataframe (sc.parallelize (Range (6, One). Map (LambdaI:row (single=i, Triple=i *3))) Df2.save ("data/test_table/key=2","Parquet")# Read The partitioned tableDF3 = Sqlcontext.load ("Data/test_table","Parquet") Df3.printschema ()# The final schema consists of all 3 columns in the Parquet files together# with the partitioning column appeared in the partition directory paths.# root# |--Single:int (nullable = True)# |--Double:int (nullable = True)# |--Triple:int (nullable = True)# |--Key:int (nullable = True)
    • Configuration
      Configuration parquet can use the setconf command in SqlContext or run the SET key=value command in SQL.
Property name Default Value meaning
Spark.sql.parquet.binaryAsString False Some systems that use parquet, especially Impala and the old version of Spark SQL, do not differentiate between binary data and string data when writing to the Parquet schema. This flag tells spark SQL to interpret the binary data as a string to provide compatibility.
Spark.sql.parquet.int96AsTimestamp True Some systems that use parquet, especially Impala, store timestamp as INT96. Spark also stores timestamp as INT96 in order to avoid the loss of precision at the nanometer level. This flag tells spark SQL to interpret the INT96 data as timestamp to provide compatibility.
Spark.sql.parquet.cacheMetadata True Turns on caching of Parquet schema metadata. Can speed up the query of static data.
Spark.sql.parquet.compression.codec Gzip Sets the compression encoding when writing to the Parquet file. Can be set to: uncompressed, snappy, gzip, Lzo
Spark.sql.parquet.filterPushdown False Turn on Parquet filter pushdown optimization. The default is off due to the known bugin parquet 1.6.0rc3 (PARQUET-136). However, if your table does not contain any nullable string or binary columns, it is still safe to turn on this switch.
Spark.sql.parquet.convertMetastoreParquet True When set to False,spark SQL will use Hive SerDe for parquet tables instead of built-in support.

Spark SQL and DataFrame Guide (1.4.1)--The data Sources

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.