If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably Want to use SQL. You can write the SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr
. The package had dplyr
a generalized backend for data sources this translates your R code into SQL. You can use RStudio and to work dplyr
with several of the most popular software packages in the Hadoop ecosystem, Includin G Hive, Impala, HBase and Spark.
There is methods for accessing data in Hadoop using and dplyr
SQL.
Odbc
You can connect r and RStudio to hadoop with An odbc connection. This effectively treats Hadoop like any other data source (i.e., as if hadoop were a relational database). You'll need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You'll also need a few R packages. We recommend using these R packages: DBI
, Dplyr
, and ODBC
. Note that the , Dplyr
package may also reference the Dbplyr
Package To help translate R into specific variants of SQL. can use THE&NBSP, ODBC
package to Create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(), driver = <driver>, host =
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you is running Spark on Hadoop, you may also elect to use the package to sparklyr
access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. sparklyr
the package communicates with the Spark API to run SQL queries, and it also have a dplyr
backend. You can use the sparklyr
create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
Dbgetquery("select * from MyTable") # SQL
Spark_disconnect (Con)
Transferred from: Https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using Dplyr and SQL