Spark SQL Read-write method

Source: Internet
Author: User

DataFrame: An RDD with a list of names

First, we know that the purpose of sparksql is to use an SQL statement to manipulate the RDD, similar to hive. The core structure of Sparksql is dataframe, if we know the field inside the RDD, and we know the data type inside it, it's like a table in the relational database. Then we can write SQL, so we can't actually use object-oriented thinking to program here. The best way to do this is to make the abstraction a table and then use the SQL statement to manipulate it.

Dataframe Storage: The storage it uses is stored in a form similar to a database table. A data table is composed of several parts: 1, data, this data is a row of storage, a record is a row, 2, Data table data dictionary, including the name of the table, table fields and the type of fields and other metadata information. Then Dataframe is also stored as rows, which is row, row by row of data storage. In general, processing granularity is row-granularity, and it is not necessary to manipulate its in-row data.

Second, the procedure entrance of Sparksql:

Before the Spark2.0, there was the concept of sqlcontext and Hivecontext, because the two concepts were indistinguishable, and after Spark2.0 unification was called Sparksession, In addition, Sparksession also encapsulates sparkconf and Sparkcontext.

It is important to note that hive has a lot of dependency packages, so these dependencies are not included in the default spark package. If hive-dependent packages can be found in classpath, Spark will automatically load them. These hive dependency packages must be replicated to all work nodes, because they invoke the serialization and deserialization (SerDes) package of hive in order to access the data stored in hive. Hive configuration Files Hive-site.xml, Core-site.xml (Security Configuration), and Hdfs-site.xml (HDFs configuration) are saved under the Conf directory.
When using hive, a hive-enabled sparksession must be initialized, and the user can still use hive even if the environment is not deploying a hive. When Hive-site.xml is not configured, Spark automatically creates metastore_ in the current app directory DB and create a directory configured by Spark.sql.warehouse.dir, if not configured, the default is the Spark-warehouse directory under the current app directory.

Note: Starting with the Spark 2.0.0 release, The Hive.metastore.warehouse.dir property inside the Hive-site.xml has been replaced by Spark.sql.warehouse.dir, which specifies the default data path for warehouse (must have write permission).

So sparksql, in the case of interaction with hive, needs to specify support hive:

New Sparkconf (). Setappname (S"${this.getclass.getsimplename}"= Sparksession.builder (). config (conf). config ("spark.sql.warehouse.dir",  "hdfs://hadoop1:9000/user/hive/warehouse"). Enablehivesupport (). Getorcreate ()

Back to the point, the program entrance:

Version 1.6:

Val conf=New  sparkconf () conf.setappname (S"${this.getclass.getsimplename}  "). Setmaster ("local") val sc=newNew SqlContext (SC)

Version 2.0:

Sparksql's program entry is reduced to one sentence

Val Sparksession=sparksession.builder (). AppName (S"${this.getclass.getsimplename}"). Master ("local"). Getorcreate ()

Two versions one obtains SqlContext (or Hivecontext), one obtains sparksession.

Three, forget, or put together to write it.

 Case  classPerson (varName:string,varage:int)ObjectTest {def main (args:array[string]): Unit= {    //1.6 Version EntranceVal conf=Newsparkconf () conf.setappname (s"${this.getclass.getsimplename}"). Setmaster ("Local") Val SC=Newsparkcontext (conf) Val SqlContext=NewSqlContext (SC)//the first way to create Dataframe: directly read the format of the Columnstore, you can directly form the dataframe (how to do next?) )Val df:dataframe = SqlContext.read.json ("")
  //  The second to create Dataframe Mode: Because the RDD does not have a todf () method, an implicit conversion is required, and an array is formed after the map   import sqlcontext.implicits._ val DF: DataFrame  = sc.textfile ( " c:\\users\\ Wangyongxiang\\desktop\\plan\\person.txt   "). Map (_.split ("    ")". Map (P = Person (P (0 ), p (1 ). Trim.toint). TODF () 
// another form of the second method, with SqlContext or sparksession createdataframe (), is in fact identical to the    todf () method Val Rdd:rdd[person] = sc.textfile ("c:\\users\\wangyongxiang\\desktop\\plan\\person.txt"  )      . Map (_.split (",")). Map (P. = Person (p (0), p ( 1). Trim.toint))    = Sqlcontext.createdataframe (RDD)
    //The third creates a dataframe: Generates a Rowrdd, and then gives a description of the constructVal Rdd=sc.textfile ("C:\\users\\wangyongxiang\\desktop\\plan\\person.txt") Val Rowrdd:rdd[row]= Rdd.map (_.split (","). Map (P=>row (P (0), P (1). Trim.toint)) Val schame=Structtype (Structfield ("name", StringType,true):: Structfield (" Age", Integertype,true):: Nil) Val df:dataframe=sqlcontext.createdataframe (rowrdd,schame)
    // subsequent code, you can create a temporary view as a query, and MySQL interop to create a temporary view to make a query
//with Hivecontext, the table is created directly in hive and then the data is load into the hive table, which can be queried directly without creating a temporary view, followed by the hive integration.Df.registertemptable (" Person") Sqlcontext.sql ("SELECT * from person where age>21"). Show ()//Save the processed data to a MySQL database using JDBC to become a table, note that here to use the user and not use username, because the system also has a username, will overwrite your user nameVal properties=NewProperties () properties.put ("User","Root") Properties.put ("Password","Root") Df.write.mode (savemode.overwrite) JDBC ("jdbc:mysql://localhost:3306/test","Test", Properties)}}

Iv. load and save operations.

Objectsaveandloadtest {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setappname (""). Setmaster ("Local") Val SC=Newsparkcontext (conf) Val SqlContext=NewSqlContext (SC)//read,load: ReadSqlContext.read.json ("")//sqlContext.read.jdbc ("url", "Table", properties)SqlContext.read.load ("Parquet Path") SqlContext.read.format ("JSON"). Load ("Path") Val Df:dataframe= SqlContext.read.format ("Parquet"). Load ("Path")     //Write,save SaveDf.write.parquet ("path. Parquet") Df.write.json ("path. JSON")//df.write.jdbc ("url", "Table", properties)Df.write.format ("Parquet"). Save ("path. Parquet") Df.write.format ("JSON"). Save ("path. JSON")    //Save mode can choose to overwrite, append etc.Df.write.mode (Savemode.overwrite). Save ("")  }}

The personal understanding is that read and load are both functions of reading, write and save are the functions of saving, through the above code, we can complete the work of file format conversion, some inefficient format into parquet this sparksql native supported file types

Spark SQL Read-write method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.