Apache Spark Source Code read 10-run sparkpi on Yarn

Last Update:2014-07-07 Source: Internet

Author: User

Tags hdfs dfs arch linux

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Y. You are welcome to repost it. Please indicate the source, huichiro.

Summary

"Spark is a headache, and we need to run it on yarn. What is yarn? I have no idea at all. What should I do. Don't tell me how it works. Can you tell me how to run spark on yarn? I'm a dummy, just told me how to do it ."

If you and I are not too interested in the metaphysical things, but are entangled in how to do it, reading this guide will not disappoint you ,:).

Preparations

All the operations in this article are based on arch Linux and ensure that the following software has been installed

JDK
Scala
Maven

Build hadoop

Like its logo, hadoop is really a huge elephant. If you start to get started with this stuff, it will surely be dizzy for a long time. As a coincidence, it was smooth from storm.

Hadoop mainly involves HDFS and mapreduce framework. For the second generation of hadoop, that is, hadoop 2, this framework has become a very popular yarn. If I have never heard of yarn, I am embarrassed to say that I have played hadoop.

Don't be kidding. Note that the most important information in the above section is HDFS and mapreduce framework. All our subsequent configurations are centered on these two topics.

Create user

Add User Group: hadoop, add user hduser

groupadd hadoopuseradd -b /home -m -g hadoop hduser

Download the hadoop running version

Assume that you are currently logged on as a root user, and now you need to switch to the hduser

Su-hduserid # Check whether the switchover is successful. If everything is OK, the following uid = 1000 (hduser) gid = 1000 (hadoop) groups = 1000 (hadoop) is displayed)

Download and decompress hadoop 2.4

cd /home/hduserwget http://mirror.esocc.com/apache/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gztar zvxf hadoop-2.4.0.tar.gz

Set Environment Variables

export HADOOP_HOME=$HOME/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HOME/hadoop-2.4.0export HADOOP_COMMON_HOME=$HOME/hadoop-2.4.0export HADOOP_HDFS_HOME=$HOME/hadoop-2.4.0export HADOOP_YARN_HOME=$HOME/hadoop-2.4.0export HADOOP_CONF_DIR=$HOME/hadoop-2.4.0/etc/hadoop

To avoid repeated setting of these variables every time, you can add the preceding statement. BashrcFile.

Create directory

The Created directory is used for the HDFS-related namenode in hadoop, that is, datanode.

mkdir -p $HOME/yarn_data/hdfs/namenodemkdir -p $HOME/yarn_data/hdfs/datanode

Modify the hadoop configuration file

Configuration is required for the following files

Yarn-site.xml
Core-site.xml
Hdfs-site.xml
Mapred-site.xml

Switch to the hadoop installation directory

$cd $HADOOP_HOME

ModifyETC/hadoop/yarn-site.xml,Add the following content between <configuration> and </configuration>, and add other files in the same location.

<property>   <name>yarn.nodemanager.aux-services</name>   <value>mapreduce_shuffle</value></property><property>   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>   <value>org.apache.hadoop.mapred.ShuffleHandler</value></property>

ETC/hadoop/core-site.xml

<Property> <Name> fs. Default. Name </Name><Value> HDFS: // localhost: 9000 </value><! -- Yarnclient uses this configuration item --> </property>

ETC/hadoop/hdfs-site.xml

<Property> <Name> DFS. replication </Name> <value> 1 </value> </property> <Name> DFS. namenode. name. dir </Name> <value> file:/home/hduser/yarn_data/HDFS/namenode </value> <! -- Used in node formatting --> </property> <Name> DFS. datanode. data. dir </Name> <value> file:/home/hduser/yarn_data/HDFS/datanode </value> </property>

ETC/hadoop/mapred-site.xml

<property>      <name>mapreduce.framework.name</name>      <value>yarn</value></property>

Format namenode

$ bin/hadoop namenode -format

Start HDFS-related processes start namenode

$ sbin/hadoop-daemon.sh start namenode

Start datanode

$sbin/hadoop-daemon.sh start datanode

Start mapreduce Framework processes start Resource Manager

sbin/yarn-daemon.sh start resourcemanager

Start Node Manager

sbin/yarn-daemon.sh start nodemanager

Start job history Server

sbin/mr-jobhistory-daemon.sh start historyserver

Verify deployment

$jps18509 Jps17107 NameNode17170 DataNode17252 ResourceManager17309 NodeManager17626 JobHistoryServer

Run wordcount

The best way to verify that hadoop is successfully built is to run a wordcount on it.

$mkdir in$cat > in/fileThis is one lineThis is another line

Copy the file to HDFS

$bin/hdfs dfs -copyFromLocal in /in

Run wordcount

bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /in /out

View running results

bin/hdfs dfs -cat /out/*

Take a rest, configure it here, and it will be sweaty. Next, run spark on yarn, and then stick to it for a short time.

Run sparkpi on yarn to download spark

Download spark for hadoop2

Run sparkpi

ContinueHduserIdentity, the main point is to set the yarn_conf_dir or hadoop_conf_dir environment variable

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopSPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar ./examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.JavaSparkPi --args yarn-standalone --num-workers 1 --master-memory 512m --worker-memory 512m --worker-cores 1

Check running results

The running result is saved in the stdout directory of the related application. You can find it using the following command:

cd $HADOOP_HOMEfind . -name "*stdout"

Assume that the file is./Logs/userlogs/application_14004799249700000002/container_14004799249700000002_01_000001/stdout,Use cat to view the results

cat ./logs/userlogs/application_1400479924971_0002/container_1400479924971_0002_01_000001/stdoutPi is roughly 3.14028

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More