Sparkr install the steps and problems that occur

Last Update:2018-07-26 Source: Internet

Author: User

Tags zip sparkr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[6]: Https://github.com/benweet/stackedit1. Installation configuration for Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

Rjava also provides the functionality of Java invoke R, which is implemented through Jri (JAVA/R Interface). Jri is now embedded in the Rjava package, and we can also try this feature individually. Now Rjava package, has become a lot of Java development based on the basic functional components of the R package.

It is because Rjava is the underlying interface and is called using JNI as an interface, so it is very efficient. In the Jri scenario, the JVM loads the RVM directly through memory, and the call process performance is virtually lossless, making it a very efficient connection channel and a preferred development package for R and Java communications.
1.2.2. Rjava Installation

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

Sparkr is an R development package released by Amplab that provides a lightweight front end for Apache Spark. SPARKR provides a flexible distributed data set (RDD) API in spark that allows users to run jobs interactively through R shell on a cluster. Sparkr the benefits of Spark and R, and the following 3 illustrations illustrate how the Sparkr works.

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Third, reading the data, the core of Spark is the resilient distributed Dataset (RDD), Rdds can be created from the inputformats of Hadoop (for example, HDFs files) or by converting other rdds. For example, the following is an example of reading data directly from HDFs as an rdd:

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

When we get here, we can use the Rdd action (actions) and the transform (transformations) to manipulate the RDD and generate a new RDD, or we can easily invoke the R development package. You only need to read the R development package with Includepackage before performing the operation on the cluster (example: Includepackage (SC, Matrix)), and of course it can be manipulated by converting the RDD to the data form of the R language format.

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works. 1. Installation configuration of Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works. 1. Installation configuration of Sparkr
1.1. Installation of R and Rstudio
1.1.1. Installation of R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base
1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb
1.2. Rjava Installation
1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

1) Configuring the Rjava environment

Execute R-CMD javareconf

root@testnode4:/home/payton# R CMD javareconf

2) Start R and install Rjava

root@testnode4:/home/payton# R

Install.packages ("Rjava")

1.3. Installation of Sparkr
1.3.1. Sparkr code Download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg
1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.
1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete. Operation of the Sparkr
2.1. Operation mechanism of SPARKR

2.2. Using Sparkr for data analysis
2.2.1. SPARKR Basic Operations

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

              , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Lines <-Textfile (SC, "Hdfs://sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works.
Article reprinted from: http://www.cnblogs.com/payton/p/4227770.html
When individuals follow these steps, the following problems occur:

  Invalid or corrupt jarfile Sbt/sbt-launch-0.13.5.jar

My solution is to download an already compiled sbt-launch-0.13. Jar package, to replace the Sbt-launch-0.13.6.jar in/OPT/SPARKR-PKG-MASTER/PKG/SRC/SBT. This file has uploaded my resources, please find it yourself.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More