Mutual transformation of Dataframe and database

Source: Internet
Author: User

In Spark, Dataframe can literally be called a text file in memory.

It's as simple as working with TXT, CSV, and JSON files on your computer.

Val sparkconf = new sparkconf (). Setappname ("df2db"). Setmaster ("local[1]")

Val sc = new Sparkcontext (sparkconf)

Val sqlcontext:sqlcontext = new SqlContext (SC)

Val df = SqlContext.read.format ("CSV"). Option ("Header", "true"). Load ("D:\\spark test\\123")

Val snaptable = "env0y"

Df.registertemptable (snaptable)

The few words above to convert a CSV file to Dataframe and register as a temporary table, this time you can operate as a database table operation of this snaptable:

Val sql = "SELECT * from" + snaptable

Val dftmp = this.sqlContext.sql (SQL)

This code is easy to write, but unfortunately Dataframe only exists in memory, and our business code only outputs the results specified in the algorithm.

In other words, if the result is wrong, bad positioning is dataframe itself data error, or the code of SQL write wrong ...

If you can operate dataframe anywhere, how to do it?

Save the Dataframe to a real database:

Import Java.util.Properties

Val connectionurl = "jdbc:sqlserver://10.119.46.153:1433"

Val table = "Nettransmit.dbo.df2mssql"

Val prop = new Properties ()

Prop.put ("JDBC. Driver "," Com.microsoft.sqlserver.jdbc.SQLServerDriver ")

Prop.put ("User", "sa")

Prop.put ("Password", "ibas.1597")

Val dfwriter = Df.write.mode (savemode.overwrite). JDBC (Connectionurl, table, prop)

Well, if the calculation goes wrong, we can debug a sorta by directly connecting to a few SQL databases.

The only thing to note is that DataFrame to database is not a business requirement, so the above code can only exist in the development mode or test mode, the official release version should not appear?

Since you can write it, you can read it naturally:

SQL Server 2 Dataframe

Val dfviatable = SqlContext.read.jdbc (Connectionurl,table,prop)

Dfviatable.show (10)

Above, the minimal interaction between the Dataframe and the database is complete, but if there is a real need to read and write databases in the business, performance problems can become bottlenecks and should be noted.

Next is a little bit of optimization.

From CSV to Dataframe, we use the Df.printschema () statement to see output similar to the following in the console:

Root

|--imsi:string (nullable = True)

|--usertype:string (nullable = True)

|--Total PS Traffic (KB): string (nullable = True)

|--Total Online time (s): string (nullable = True)

|--Total CS Traffic (ERL): string (nullable = True)

|--brand:string (nullable = True)

|--series:string (nullable = True)

|--os:string (nullable = True)

|--type:string (nullable = True)

|--FDD lte:string (nullable = True)

|--TDD lte:string (nullable = True)

|--only 3G capability:string (nullable = True)

That is, the type of each field after writing to the database is a string, which is obviously a waste.

Also, many values can be used with an int or double or bool type.

What do we do? To modify the "dialect" of the database, as in C + + Std::locale to establish local rules.

For the sake of convenience, encapsulate:

Import java.io. {File, FileInputStream}

Import Java.util.Properties

Import Org.apache.spark.sql.jdbc. {jdbcdialect, jdbcdialects, Jdbctype}

Import Org.apache.spark.sql.types._

Import Org.apache.spark.sql. {DataFrame, Savemode}

/**

* Created by env0y on 2017/11/24.

*/

Object Dataframe2db {

def df2db (df:dataframe,table:string,properties:string) = {

try{

Val is = new FileInputStream (new File (properties))

Val prop = new Properties ()

Prop.load (IS)

Val url = string.valueof (prop.get ("url"))//

Jdbcdialects.registerdialect (Sqlserverdialect)

Df.write.mode (savemode.overwrite). JDBC (Url,table,prop)

Is.close ()

}

}

Val sqlserverdialect = new Jdbcdialect {

Override Def Canhandle (url:string): Boolean = Url.startswith ("jdbc:jtds:sqlserver") | | Url.contains ("SQL Server")

Override Def Getjdbctype (Dt:datatype): option[jdbctype] = dt Match {

Case StringType = Some (Jdbctype ("NVARCHAR (+)", Java.sql.Types.VARCHAR))

Case booleantype = Some (Jdbctype ("BIT (1)", Java.sql.Types.BIT))

Case integertype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case longtype = Some (Jdbctype ("BIGINT", Java.sql.Types.BIGINT))

Case doubletype = Some (Jdbctype ("DOUBLE PRECISION", Java.sql.Types.DOUBLE))

Case floattype = Some (Jdbctype ("REAL", Java.sql.Types.REAL))

Case shorttype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case bytetype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case binarytype = Some (Jdbctype ("BINARY", Java.sql.Types.BINARY))

Case datetype = Some (Jdbctype ("DATE", Java.sql.Types.DATE)

Case timestamptype = Some (Jdbctype ("DATE", Java.sql.Types.DATE)

Case decimaltype.fixed (precision, scale) = Some (Jdbctype ("number (" + Precision + "," + scale + ")", Java.sql.Types. NUMERIC))

Case t:decimaltype = Some (Jdbctype (S "DECIMAL (${t.precision},${t.scale})", Java.sql.Types.DECIMAL))

Case _ = throw new IllegalArgumentException (S "Don ' t know how to save ${dt.json} to JDBC")

}

}

}

Then call it like this:

DATAFRAME2DB.DF2DB (DF, "NETTRANSMIT.DBO.DF2DBFF", "d:\\ database.properties")

The third parameter is the database's property configuration file, which resembles the following:

#\u5f00\u53d1\u6570\u636e\u5e93

Driver=com.microsoft.sqlserver.jdbc.sqlserverdriver

Url=jdbc:sqlserver://10.119.46.153:1433;databasename=tspmanagement

Username=sa

password=ibas.1597

It's time to observe that the tables written from Dataframe to the database will find that the field properties become nvarchar (128) ~ ~

In addition, it is simple to modify the schema type in the Dataframe directly:

Val df1 = Df.withcolumn ("Only report 3G Capability", col ("only 3G Capability"). Cast (Datatypes.floattype))

Df1.printschema ()

In this respect, the above spark version is 1.6. The database involved is SQL Server.

Mutual transformation of Dataframe and database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.