With the official release of Spark SQL and its support for dataframe, it may replace hive as an increasingly important platform for analysis of structured data. In the blog post What's new for spark SQL in Spark 1.3, Databricks's engineer, Michael Armbrust, highlights the improved data Source API.
When we analyze structural data, we inevitably encounter multiple data sources. These data sources include JSON, CSV, parquet, relational databases, and NoSQL databases. Naturally, we want to be able to access these polymorphic data sources with a unified interface.
In the application scenario of our products, we need to access PostgreSQL data for data analysis. We can access it through the JDBC provided by spark SQL, as long as the driver of PostgreSQL is required. The method is to add the corresponding version of the driver dependency in the BUILD.SBT. For example:
Librarydependencies ++= {
Val sparkversion = "1.3.0"
Seq (
"Org.apache.spark" percent "spark-core"% sparkversion,
"Org.apache.spark" percent "spark-sql"% sparkversion,
"Org.postgresql"% "PostgreSQL"% "9.4-1201-jdbc41"
)
}
Based on the official documentation for Spark SQL, when invoking the data Sources API, you can load the remote database as a data frame or spark SQL temporary table via SqlContext. At load time, parameters (attributes) that can be passed in include: URL, dbtable, driver, Partitioncolumn, Lowerbound, Upperbound, and Numpartitions.
The class name of PostgreSQL Driver is org.postgresql.Driver. Because the properties do not have user and password, they are used as part of the URL. Suppose we want to connect to the database server IP 192.168.1.110, Port 5432, user name and password are test, the database is demo, to query the data table is tab_users, the code to access PostgreSQL is as follows:
Object Postgresqlapp {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setappname ("Frompostgresql"). Setmaster ("local[2]")
Val sc = new Sparkcontext (sparkconf)
Val sqlcontext = new SqlContext (SC)
Val query = "(SELECT * from Tab_users) as users"
Val url = "Jdbc:postgresql://192.168.1.110:5432/demo?user=test&password=test"
Val users = sqlcontext.load ("jdbc", Map (
URLs, URL,
"Driver", "Org.postgresql.Driver",
"DBTable", query
))
Users.foreach (println)
}
}
The above code puts the query statement directly in the queries variable and passes it to the sqlcontext to load. Another way is to pass the table name directly, and then register the temporary table by calling the Registertemptable () method, and invoke the SQL () method to execute the query:
Object Postgresqlapp {
def main (args:array[string]): Unit = {
Val sparkconf = new sparkconf (). Setappname ("Frompostgresql"). Setmaster ("local[2]")
Val sparkconf = new sparkconf (). Setappname ("sparksql_select_table")
. Set ("Spark.driver.allowMultipleContexts", "true")
. Set ("Spark.sql.shuffle.partitions"," a")
Local boot
. Setmaster ("local[2]");
Val sc = new Sparkcontext (sparkconf)
Val sqlcontext = new SqlContext (SC)
Val url = "Jdbc:postgresql://192.168.1.110:5432/demo?user=test&password=test"
Val dataFrame = sqlcontext.load ("jdbc", Map (
URLs, URL,
"Driver", "Org.postgresql.Driver",
"DBTable", "Tab_users"
))
Dataframe.registertemptable ("USERS")
Val users = Sqlcontext.sql ("SELECT * from users")
Users.foreach (println)
}
}
From a performance perspective, you can also set some configuration items when creating SqlContext, for example:
Val sqlcontext = new SqlContext (SC)
Sqlcontext.setconf ("Spark.sql.inMemoryColumnarStorage.batchSize", "10000")
Spark SQL Access PostgreSQL