Apache Spark Source Code read 10-run sparkpi on Yarn

Source: Internet
Author: User
Tags hdfs dfs arch linux

Y. You are welcome to repost it. Please indicate the source, huichiro.

Summary

"Spark is a headache, and we need to run it on yarn. What is yarn? I have no idea at all. What should I do. Don't tell me how it works. Can you tell me how to run spark on yarn? I'm a dummy, just told me how to do it ."

If you and I are not too interested in the metaphysical things, but are entangled in how to do it, reading this guide will not disappoint you ,:).

Preparations

All the operations in this article are based on arch Linux and ensure that the following software has been installed

  1. JDK
  2. Scala
  3. Maven
Build hadoop

Like its logo, hadoop is really a huge elephant. If you start to get started with this stuff, it will surely be dizzy for a long time. As a coincidence, it was smooth from storm.

Hadoop mainly involves HDFS and mapreduce framework. For the second generation of hadoop, that is, hadoop 2, this framework has become a very popular yarn. If I have never heard of yarn, I am embarrassed to say that I have played hadoop.

Don't be kidding. Note that the most important information in the above section is HDFS and mapreduce framework. All our subsequent configurations are centered on these two topics.

Create user

Add User Group: hadoop, add user hduser

groupadd hadoopuseradd -b /home -m -g hadoop hduser
Download the hadoop running version

Assume that you are currently logged on as a root user, and now you need to switch to the hduser

Su-hduserid # Check whether the switchover is successful. If everything is OK, the following uid = 1000 (hduser) gid = 1000 (hadoop) groups = 1000 (hadoop) is displayed)

Download and decompress hadoop 2.4

cd /home/hduserwget http://mirror.esocc.com/apache/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gztar zvxf hadoop-2.4.0.tar.gz
Set Environment Variables
export HADOOP_HOME=$HOME/hadoop-2.4.0export HADOOP_MAPRED_HOME=$HOME/hadoop-2.4.0export HADOOP_COMMON_HOME=$HOME/hadoop-2.4.0export HADOOP_HDFS_HOME=$HOME/hadoop-2.4.0export HADOOP_YARN_HOME=$HOME/hadoop-2.4.0export HADOOP_CONF_DIR=$HOME/hadoop-2.4.0/etc/hadoop

To avoid repeated setting of these variables every time, you can add the preceding statement. BashrcFile.

Create directory

The Created directory is used for the HDFS-related namenode in hadoop, that is, datanode.

mkdir -p $HOME/yarn_data/hdfs/namenodemkdir -p $HOME/yarn_data/hdfs/datanode
Modify the hadoop configuration file

Configuration is required for the following files

  1. Yarn-site.xml
  2. Core-site.xml
  3. Hdfs-site.xml
  4. Mapred-site.xml

Switch to the hadoop installation directory

$cd $HADOOP_HOME

ModifyETC/hadoop/yarn-site.xml,Add the following content between <configuration> and </configuration>, and add other files in the same location.

<property>   <name>yarn.nodemanager.aux-services</name>   <value>mapreduce_shuffle</value></property><property>   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>   <value>org.apache.hadoop.mapred.ShuffleHandler</value></property>

ETC/hadoop/core-site.xml

<Property> <Name> fs. Default. Name </Name><Value> HDFS: // localhost: 9000 </value><! -- Yarnclient uses this configuration item --> </property>

ETC/hadoop/hdfs-site.xml

<Property> <Name> DFS. replication </Name> <value> 1 </value> </property> <Name> DFS. namenode. name. dir </Name> <value> file:/home/hduser/yarn_data/HDFS/namenode </value> <! -- Used in node formatting --> </property> <Name> DFS. datanode. data. dir </Name> <value> file:/home/hduser/yarn_data/HDFS/datanode </value> </property>

ETC/hadoop/mapred-site.xml

<property>      <name>mapreduce.framework.name</name>      <value>yarn</value></property>
Format namenode
$ bin/hadoop namenode -format
Start HDFS-related processes start namenode
$ sbin/hadoop-daemon.sh start namenode
Start datanode
$sbin/hadoop-daemon.sh start datanode
Start mapreduce Framework processes start Resource Manager
sbin/yarn-daemon.sh start resourcemanager
Start Node Manager
sbin/yarn-daemon.sh start nodemanager
Start job history Server
sbin/mr-jobhistory-daemon.sh start historyserver
Verify deployment
$jps18509 Jps17107 NameNode17170 DataNode17252 ResourceManager17309 NodeManager17626 JobHistoryServer
Run wordcount

The best way to verify that hadoop is successfully built is to run a wordcount on it.

$mkdir in$cat > in/fileThis is one lineThis is another line

Copy the file to HDFS

$bin/hdfs dfs -copyFromLocal in /in

Run wordcount

bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar wordcount /in /out

View running results

bin/hdfs dfs -cat /out/*

Take a rest, configure it here, and it will be sweaty. Next, run spark on yarn, and then stick to it for a short time.

Run sparkpi on yarn to download spark

Download spark for hadoop2

Run sparkpi

ContinueHduserIdentity, the main point is to set the yarn_conf_dir or hadoop_conf_dir environment variable

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopSPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar ./examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar --class org.apache.spark.examples.JavaSparkPi --args yarn-standalone --num-workers 1 --master-memory 512m --worker-memory 512m --worker-cores 1
Check running results

The running result is saved in the stdout directory of the related application. You can find it using the following command:

cd $HADOOP_HOMEfind . -name "*stdout"

Assume that the file is./Logs/userlogs/application_14004799249700000002/container_14004799249700000002_01_000001/stdout,Use cat to view the results

cat ./logs/userlogs/application_1400479924971_0002/container_1400479924971_0002_01_000001/stdoutPi is roughly 3.14028

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.