Accessing data in Hadoop using Dplyr and SQL

Source: Internet
Author: User
Tags hadoop ecosystem sparklyr

If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably Want to use SQL. You can write the SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr . The package had dplyr a generalized backend for data sources this translates your R code into SQL. You can use RStudio and to work dplyr with several of the most popular software packages in the Hadoop ecosystem, Includin G Hive, Impala, HBase and Spark.

There is methods for accessing data in Hadoop using and dplyr SQL.

Odbc

You can connect r and RStudio to hadoop with An odbc connection. This effectively treats Hadoop like any other data source (i.e., as if hadoop were a relational database). You'll need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You'll also need a few R packages. We recommend using these R packages:  DBI ,   Dplyr , and  ODBC . Note that the , Dplyr  package may also reference the  Dbplyr   Package To help translate R into specific variants of SQL. can use THE&NBSP, ODBC  package to Create a connection with Hadoop and run queries:

library(odbc)

con <- dbConnect(odbc::odbc(), driver = <driver>, host =
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

dbDisconnect(con)
Spark

If you is running Spark on Hadoop, you may also elect to use the package to sparklyr access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. sparklyrthe package communicates with the Spark API to run SQL queries, and it also have a dplyr backend. You can use the sparklyr create a connect with Spark run queries:

library(sparklyr)
con <- spark_connect(master = "yarn-client")

tbl(con, "mytable") # dplyr
Dbgetquery("select * from MyTable") # SQL

Spark_disconnect (Con)

Transferred from: Https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL

Accessing data in Hadoop using Dplyr and SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.