Sparkr install the steps and problems that occur

Source: Internet
Author: User
Tags zip sparkr

[6]: Https://github.com/benweet/stackedit1. Installation configuration for Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

Rjava also provides the functionality of Java invoke R, which is implemented through Jri (JAVA/R Interface). Jri is now embedded in the Rjava package, and we can also try this feature individually. Now Rjava package, has become a lot of Java development based on the basic functional components of the R package.

It is because Rjava is the underlying interface and is called using JNI as an interface, so it is very efficient. In the Jri scenario, the JVM loads the RVM directly through memory, and the call process performance is virtually lossless, making it a very efficient connection channel and a preferred development package for R and Java communications.
1.2.2. Rjava Installation

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

Sparkr is an R development package released by Amplab that provides a lightweight front end for Apache Spark. SPARKR provides a flexible distributed data set (RDD) API in spark that allows users to run jobs interactively through R shell on a cluster. Sparkr the benefits of Spark and R, and the following 3 illustrations illustrate how the Sparkr works.

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Third, reading the data, the core of Spark is the resilient distributed Dataset (RDD), Rdds can be created from the inputformats of Hadoop (for example, HDFs files) or by converting other rdds. For example, the following is an example of reading data directly from HDFs as an rdd:

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

When we get here, we can use the Rdd action (actions) and the transform (transformations) to manipulate the RDD and generate a new RDD, or we can easily invoke the R development package. You only need to read the R development package with Includepackage before performing the operation on the cluster (example: Includepackage (SC, Matrix)), and of course it can be manipulated by converting the RDD to the data form of the R language format.

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works. 1. Installation configuration of Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

Rjava also provides the functionality of Java invoke R, which is implemented through Jri (JAVA/R Interface). Jri is now embedded in the Rjava package, and we can also try this feature individually. Now Rjava package, has become a lot of Java development based on the basic functional components of the R package.

It is because Rjava is the underlying interface and is called using JNI as an interface, so it is very efficient. In the Jri scenario, the JVM loads the RVM directly through memory, and the call process performance is virtually lossless, making it a very efficient connection channel and a preferred development package for R and Java communications.
1.2.2. Rjava Installation

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

Sparkr is an R development package released by Amplab that provides a lightweight front end for Apache Spark. SPARKR provides a flexible distributed data set (RDD) API in spark that allows users to run jobs interactively through R shell on a cluster. Sparkr the benefits of Spark and R, and the following 3 illustrations illustrate how the Sparkr works.

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Third, reading the data, the core of Spark is the resilient distributed Dataset (RDD), Rdds can be created from the inputformats of Hadoop (for example, HDFs files) or by converting other rdds. For example, the following is an example of reading data directly from HDFs as an rdd:

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

When we get here, we can use the Rdd action (actions) and the transform (transformations) to manipulate the RDD and generate a new RDD, or we can easily invoke the R development package. You only need to read the R development package with Includepackage before performing the operation on the cluster (example: Includepackage (SC, Matrix)), and of course it can be manipulated by converting the RDD to the data form of the R language format.

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works. 1. Installation configuration of Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

Rjava also provides the functionality of Java invoke R, which is implemented through Jri (JAVA/R Interface). Jri is now embedded in the Rjava package, and we can also try this feature individually. Now Rjava package, has become a lot of Java development based on the basic functional components of the R package.

It is because Rjava is the underlying interface and is called using JNI as an interface, so it is very efficient. In the Jri scenario, the JVM loads the RVM directly through memory, and the call process performance is virtually lossless, making it a very efficient connection channel and a preferred development package for R and Java communications.
1.2.2. Rjava Installation

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

Sparkr is an R development package released by Amplab that provides a lightweight front end for Apache Spark. SPARKR provides a flexible distributed data set (RDD) API in spark that allows users to run jobs interactively through R shell on a cluster. Sparkr the benefits of Spark and R, and the following 3 illustrations illustrate how the Sparkr works.

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Third, reading the data, the core of Spark is the resilient distributed Dataset (RDD), Rdds can be created from the inputformats of Hadoop (for example, HDFs files) or by converting other rdds. For example, the following is an example of reading data directly from HDFs as an rdd:

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

When we get here, we can use the Rdd action (actions) and the transform (transformations) to manipulate the RDD and generate a new RDD, or we can easily invoke the R development package. You only need to read the R development package with Includepackage before performing the operation on the cluster (example: Includepackage (SC, Matrix)), and of course it can be manipulated by converting the RDD to the data form of the R language format.

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works.
Article reprinted from: http://www.cnblogs.com/payton/p/4227770.html
When individuals follow these steps, the following problems occur:

  Invalid or corrupt jarfile Sbt/sbt-launch-0.13.5.jar

My solution is to download an already compiled sbt-launch-0.13. Jar package, to replace the Sbt-launch-0.13.6.jar in/OPT/SPARKR-PKG-MASTER/PKG/SRC/SBT. This file has uploaded my resources, please find it yourself.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.