Spark SQL Access PostgreSQL

Source: Internet
Author: User
Tags postgresql

With the official release of Spark SQL and its support for dataframe, it may replace hive as an increasingly important platform for analysis of structured data. In the blog post What's new for spark SQL in Spark 1.3, Databricks's engineer, Michael Armbrust, highlights the improved data Source API.

When we analyze structural data, we inevitably encounter multiple data sources. These data sources include JSON, CSV, parquet, relational databases, and NoSQL databases. Naturally, we want to be able to access these polymorphic data sources with a unified interface.

In the application scenario of our products, we need to access PostgreSQL data for data analysis. We can access it through the JDBC provided by spark SQL, as long as the driver of PostgreSQL is required. The method is to add the corresponding version of the driver dependency in the BUILD.SBT. For example:

Librarydependencies ++= {
Val sparkversion = "1.3.0"
Seq (
"Org.apache.spark" percent "spark-core"% sparkversion,
"Org.apache.spark" percent "spark-sql"% sparkversion,
"Org.postgresql"% "PostgreSQL"% "9.4-1201-jdbc41"
)
}

Based on the official documentation for Spark SQL, when invoking the data Sources API, you can load the remote database as a data frame or spark SQL temporary table via SqlContext. At load time, parameters (attributes) that can be passed in include: URL, dbtable, driver, Partitioncolumn, Lowerbound, Upperbound, and Numpartitions.

The class name of PostgreSQL Driver is org.postgresql.Driver. Because the properties do not have user and password, they are used as part of the URL. Suppose we want to connect to the database server IP 192.168.1.110, Port 5432, user name and password are test, the database is demo, to query the data table is tab_users, the code to access PostgreSQL is as follows:

Object Postgresqlapp {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setappname ("Frompostgresql"). Setmaster ("local[2]")
Val sc = new Sparkcontext (sparkconf)
Val sqlcontext = new SqlContext (SC)

Val query = "(SELECT * from Tab_users) as users"
Val url = "Jdbc:postgresql://192.168.1.110:5432/demo?user=test&password=test"
Val users = sqlcontext.load ("jdbc", Map (
URLs, URL,
"Driver", "Org.postgresql.Driver",
"DBTable", query
))

Users.foreach (println)
}
}

The above code puts the query statement directly in the queries variable and passes it to the sqlcontext to load. Another way is to pass the table name directly, and then register the temporary table by calling the Registertemptable () method, and invoke the SQL () method to execute the query:

Object Postgresqlapp {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setappname ("Frompostgresql"). Setmaster ("local[2]")

Val sparkconf = new sparkconf (). Setappname ("sparksql_select_table")

. Set ("Spark.driver.allowMultipleContexts", "true")

. Set ("Spark.sql.shuffle.partitions"," a")

Local boot

. Setmaster ("local[2]");


Val sc = new Sparkcontext (sparkconf)
Val sqlcontext = new SqlContext (SC)

Val url = "Jdbc:postgresql://192.168.1.110:5432/demo?user=test&password=test"
Val dataFrame = sqlcontext.load ("jdbc", Map (
URLs, URL,
"Driver", "Org.postgresql.Driver",
"DBTable", "Tab_users"
))

Dataframe.registertemptable ("USERS")
Val users = Sqlcontext.sql ("SELECT * from users")
Users.foreach (println)
}
}

From a performance perspective, you can also set some configuration items when creating SqlContext, for example:

Val sqlcontext = new SqlContext (SC)
Sqlcontext.setconf ("Spark.sql.inMemoryColumnarStorage.batchSize", "10000")

Spark SQL Access PostgreSQL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.