Mutual transformation of Dataframe and database

Last Update:2018-03-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Spark, Dataframe can literally be called a text file in memory.

It's as simple as working with TXT, CSV, and JSON files on your computer.

Val sparkconf = new sparkconf (). Setappname ("df2db"). Setmaster ("local[1]")

Val sc = new Sparkcontext (sparkconf)

Val sqlcontext:sqlcontext = new SqlContext (SC)

Val df = SqlContext.read.format ("CSV"). Option ("Header", "true"). Load ("D:\\spark test\\123")

Val snaptable = "env0y"

Df.registertemptable (snaptable)

The few words above to convert a CSV file to Dataframe and register as a temporary table, this time you can operate as a database table operation of this snaptable:

Val sql = "SELECT * from" + snaptable

Val dftmp = this.sqlContext.sql (SQL)

This code is easy to write, but unfortunately Dataframe only exists in memory, and our business code only outputs the results specified in the algorithm.

In other words, if the result is wrong, bad positioning is dataframe itself data error, or the code of SQL write wrong ...

If you can operate dataframe anywhere, how to do it?

Save the Dataframe to a real database:

Import Java.util.Properties

Val connectionurl = "jdbc:sqlserver://10.119.46.153:1433"

Val table = "Nettransmit.dbo.df2mssql"

Val prop = new Properties ()

Prop.put ("JDBC. Driver "," Com.microsoft.sqlserver.jdbc.SQLServerDriver ")

Prop.put ("User", "sa")

Prop.put ("Password", "ibas.1597")

Val dfwriter = Df.write.mode (savemode.overwrite). JDBC (Connectionurl, table, prop)

Well, if the calculation goes wrong, we can debug a sorta by directly connecting to a few SQL databases.

The only thing to note is that DataFrame to database is not a business requirement, so the above code can only exist in the development mode or test mode, the official release version should not appear?

Since you can write it, you can read it naturally:

SQL Server 2 Dataframe

Val dfviatable = SqlContext.read.jdbc (Connectionurl,table,prop)

Dfviatable.show (10)

Above, the minimal interaction between the Dataframe and the database is complete, but if there is a real need to read and write databases in the business, performance problems can become bottlenecks and should be noted.

Next is a little bit of optimization.

From CSV to Dataframe, we use the Df.printschema () statement to see output similar to the following in the console:

Root

|--imsi:string (nullable = True)

|--usertype:string (nullable = True)

|--Total PS Traffic (KB): string (nullable = True)

|--Total Online time (s): string (nullable = True)

|--Total CS Traffic (ERL): string (nullable = True)

|--brand:string (nullable = True)

|--series:string (nullable = True)

|--os:string (nullable = True)

|--type:string (nullable = True)

|--FDD lte:string (nullable = True)

|--TDD lte:string (nullable = True)

|--only 3G capability:string (nullable = True)

That is, the type of each field after writing to the database is a string, which is obviously a waste.

Also, many values can be used with an int or double or bool type.

What do we do? To modify the "dialect" of the database, as in C + + Std::locale to establish local rules.

For the sake of convenience, encapsulate:

Import java.io. {File, FileInputStream}

Import Java.util.Properties

Import Org.apache.spark.sql.jdbc. {jdbcdialect, jdbcdialects, Jdbctype}

Import Org.apache.spark.sql.types._

Import Org.apache.spark.sql. {DataFrame, Savemode}

/**

* Created by env0y on 2017/11/24.

Object Dataframe2db {

def df2db (df:dataframe,table:string,properties:string) = {

try{

Val is = new FileInputStream (new File (properties))

Val prop = new Properties ()

Prop.load (IS)

Val url = string.valueof (prop.get ("url"))//

Jdbcdialects.registerdialect (Sqlserverdialect)

Df.write.mode (savemode.overwrite). JDBC (Url,table,prop)

Is.close ()

}

Val sqlserverdialect = new Jdbcdialect {

Override Def Canhandle (url:string): Boolean = Url.startswith ("jdbc:jtds:sqlserver") | | Url.contains ("SQL Server")

Override Def Getjdbctype (Dt:datatype): option[jdbctype] = dt Match {

Case StringType = Some (Jdbctype ("NVARCHAR (+)", Java.sql.Types.VARCHAR))

Case booleantype = Some (Jdbctype ("BIT (1)", Java.sql.Types.BIT))

Case integertype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case longtype = Some (Jdbctype ("BIGINT", Java.sql.Types.BIGINT))

Case doubletype = Some (Jdbctype ("DOUBLE PRECISION", Java.sql.Types.DOUBLE))

Case floattype = Some (Jdbctype ("REAL", Java.sql.Types.REAL))

Case shorttype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case bytetype = Some (Jdbctype ("INTEGER", Java.sql.Types.INTEGER))

Case binarytype = Some (Jdbctype ("BINARY", Java.sql.Types.BINARY))

Case datetype = Some (Jdbctype ("DATE", Java.sql.Types.DATE)

Case timestamptype = Some (Jdbctype ("DATE", Java.sql.Types.DATE)

Case decimaltype.fixed (precision, scale) = Some (Jdbctype ("number (" + Precision + "," + scale + ")", Java.sql.Types. NUMERIC))

Case t:decimaltype = Some (Jdbctype (S "DECIMAL (${t.precision},${t.scale})", Java.sql.Types.DECIMAL))

Case _ = throw new IllegalArgumentException (S "Don ' t know how to save ${dt.json} to JDBC")

}

Then call it like this:

DATAFRAME2DB.DF2DB (DF, "NETTRANSMIT.DBO.DF2DBFF", "d:\\ database.properties")

The third parameter is the database's property configuration file, which resembles the following:

#\u5f00\u53d1\u6570\u636e\u5e93

Driver=com.microsoft.sqlserver.jdbc.sqlserverdriver

Url=jdbc:sqlserver://10.119.46.153:1433;databasename=tspmanagement

Username=sa

password=ibas.1597

It's time to observe that the tables written from Dataframe to the database will find that the field properties become nvarchar (128) ~ ~

In addition, it is simple to modify the schema type in the Dataframe directly:

Val df1 = Df.withcolumn ("Only report 3G Capability", col ("only 3G Capability"). Cast (Datatypes.floattype))

Df1.printschema ()

In this respect, the above spark version is 1.6. The database involved is SQL Server.

Mutual transformation of Dataframe and database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mutual transformation of Dataframe and database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mutual transformation of Dataframe and database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support