This is an introductory tutorial for Giraph, which is used primarily to run a small number of input giraph programs and is not intended for production environments.
In this tutorial, we will deploy a single-node, pseudo-distributed Hadoop cluster on a physical machine line. This node is both master and slave. That is, it will run the Namenode,secondarynamenode,jobtracker,datanode and Tasktracker Java processes, and we will also deploy Giraph to this machine and deploy using the software and configuration below.
Ubuntu Server 14.04 LTS
Apache Hadoop 0.20.203.0-rc1
Apache Giraph 1.2.0-snapshot
We will deploy a single-node, pseudo-distributed Hadooper cluster, first of all, to install Java 1.6 or later, and to verify the installation:
sudo Install openjdk-7--version
You should be able to see your Java version information, the full Java is installed in /usr/lib/jvm/java-7-openjdk-amd64, in this directory you can find the Java bin and the Lib directory.
After doing this, create a dedicated Hadoop group and a new user HDUser, then add this user HDUser to the Hadoop group.
sudo AddGroup Hadoop is adding groups " Hadoop " 1004 )... Complete.
[Email protected]:~$sudoAddUser--ingroup Hadoop hduser adding users"HDUser"... Adding new users"HDUser"(1003) to the group"Hadoop"... Create a home directory"/home/hduser"... are starting from"/etc/skel"Copy files ... Enter a new UNIX password: Re-enter the new UNIX password:passwd: Successfully updated password is changing hduser user information Please enter a new value or hit Enter to use the default full name : Room number : Work phone : Home Phone : Other : Is this information correct? [Y/n] Y
Should now download and extract hadoop-0.20.203.0rc1 from Apache archives (this is the default Hadoop version in Giraph)
su - Xxxcd /usr/local sudo wget Http:// archive.apache.org/dist/hadoop/core/hadoop-0.20.203.0/ hadoop-0.20.203.0rc1.tar.gz sudo tar xzf hadoop-0.20 . 203 . 0rc1. tar .gz sudo MV hadoop-0.20 . 203.0 Hadoop sudo chown -R hduser:hadoop Hadoop
When the installation is complete, switch to user HDUser and edit the user's $HOME/.BASHRC with the following content
Export hadoop_home=/usr/local/hadoopexport java_home=/usr/lib/jvm/java-7-openjdk-amd64
This is primarily used to set Hadoop/java-related environment variables, and when finished, edit the $hadoop_home/conf/hadoop-env.sh file with the following:
Export java_home=/usr/lib/jvm/java-7-openjdk-amd64export hadoop_opts=- Djava.net.preferipv4stack=True
The second line primarily forces Hadoop to use IPV4, although IPV6 is configured on the machine. Because Hadoop needs to store temporary files while computing, you need to create a temporary directory with the following command, which is created for FS and HDFs files.
su –xxx sudo mkdir -p/app/hadoop/tmpsudochown hduser:hadoop/app/hadoop/tmp sudochmod /app/hadoop/tmp
Make sure that the/etc/hosts file has the following two lines (add or modify them if they do not exist)
127.0. 0.1 localhostxxx.xxx.xxx.xxx hdnode01
Although we can use localhost for connectivity in this single-node cluster, using hostname is often a better choice (for example, you might add a new node, transform your single node, pseudo-distributed cluster to multiple nodes, distributed clusters)
Now edit the HADOOP configuration files Core-site.xml,mapred-site.xml and Hdfs-site.xml, which are in the $hadoop_home/conf directory. Add the following content between <configuration></configuration>.
<property><name>hadoop.tmp. dir</name><value>/app/hadoop/tmp</value></property><property> <name >fs.default.name</name> <value>hdfs://</property>
<property><name>mapred.job.tracker</name> <value>hdnode01:54311</value> </property><property><name>mapred.tasktracker.map.tasks.maximum</name><value> 4</value></property><property><name>mapred.map.tasks</name><value >4</value></property>
By default, Hadoop allows 2 mappers to run simultaneously, but Giraph believes we can run 4 mappers at the same time. For this single-node, pseudo-distributed deployment, we need to add the latter two properties in the Mapred-site.xml file to meet this need, otherwise, some unit tests of giraph will fail.
Edit Hdfs-site.xml File:
<property><name>dfs.replication</name> <value>1</value> </property>
Here you just set up a copy of the storage HDFs file, because you only have one data nodes, the default value is 3, and if you do not change this value, you will receive a runtime exception.
The next step is to set up SSH for the HDUser user so that you don't need to enter a password each time the SSH connection is turned on.
su –hduser Ssh-keygen "" cat $HOME/. ssh/id_rsa.pub >> $HOME/. ssh/authorized_keys
Then use SSH to connect to the HDNODE01 under the HDUser user (you must use HDNODE01 here, because we use this node's hostname in the Hadoop configuration file), and you will be prompted for the password when you first use the user to connect to the node. When the password is being alerted, the public RSA key is actually stored in the $home/.ssh/known_hosts, and you will not need to enter the password later when you connect using SSH.
Edit $hadoop_home/conf/masters Now:
Similarly, use the following content to edit $hadoop_home/conf/slaves:
These edits are to set up a single-node, pseudo-distributed Hadoop cluster, including a master and a slave, which are on the same physical machine. If you want to deploy a multi-node, distributed HADOOP Cluster, you should add additional data nodes (for example, Hdnode02, hdnode03) in the $hadoop_home/conf/slaves file when you complete the above steps on each node.
In order to initialize HDFS, format it, you need to execute the following command:
$HADOOP _home/bin/hadoop Namenode-format
Then open the HDFs and map/reduce processes, using the following order of execution:
$HADOOP _home/bin/start-dfs. SH $HADOOP _home/bin/start-mapred. SH
Make sure all Java processes are running and execute the following command:
It will output the following results:
9079 NameNode 9560 Jobtracker 9263 DataNode 9453 Secondarynamenode 16316 Jps 9745 Tasktracker
To stop the process, close the process in reverse order of opening ($HADOOP _home/bin/stop-*.sh script). It is important that you will not lose your data and now has completed a single node, pseudo-distributed Hadoop cluster.
Run a map/reduce task
Now that we have a Hadoop cluster running, we can run the Map/reduce task. We will use the example of WordCount, which reads the text file and calculates the number of occurrences of each word. Input is a text file, output is also a text file, output each line contains a word and the number of occurrences of the word, tab-delimited. This example compresses in $hadoop_home/hadoop-examples-0.20.203.0.jar. Now let's get started, download a larger UTF-8 file into the temp directory and copy it to HDFs. Then make sure it is copied successfully.
cd/tmp/wget http://www.gutenberg.org/cache/epub/132/pg132.txt$HADOOP _home/bin/hadoop dfs-copyfromlocal/tmp/pg132.txt/user/hduser/input/pg132.txt$hadoop_home/bin/hadoop dfs-ls / User/hduser/input
After you have finished, you can run the WordCount example, in order to start a map/reduce task, you can use the $hadoop_home/bin/hadoop jar command:
$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-examples-0.20. 203.0. Jar Wordcount/user/hduser/input/pg132.txt/user/hduser/output/wordcount
You can monitor the progress of your tasks using the Web UI:
As soon as the task is complete, you can detect the output with the following command:
$HADOOP _home/bin/hadoop DFS-cat Less
We will deploy giraph in order to build Giraph from repository. You need to install Git and Maven3 first, by running the following command:
su - hdadminsudoinstall gitsudoinstall- Version
Make sure you install Maven3 or higher version, giraph use Munge plugin, this plugin needs Maven3, in order to support multiple versions of Hadoop, Web site plugin also needs Maven3, you can clone giraph from GitHub:
cd/usr/local/sudo git clone https://github.com/apache/giraph.gitsudo Chown -R hduser:hadoop giraphsu -hduser
After modifying the user HDUser's $HOME/.BASHRC file, use the following content:
Save and close the file, and then verify, compile, test, compress giraph as a jar file by running the following command:
Parameter-dskiptests will skip the test session and take some time to run the first time, because maven will download some files to your local repository. You may have to perform many times to succeed, because the remote server may time out, as long as packaging succeeds, you will have a giraph Core JAR, where it is $giraph_home/giraph-core/target/ giraph-1.2.0-snapshot-for-hadoop-0.20.203.0-jar-with-dependencies.jar,giraph example jar file, its location is $giraph_home/ giraph-examples/target/ Giraph-examples-1.1.0-snapshot-for-hadoop-0.20.203.0-jar-with-dependencies.jar, now that you have finished deploying Giraph.
Run a giraph task
As soon as we're done deploying Giraph and Hadoop, we can start running our first giraph task, and we'll use the simpleshortestpathscomputation example, which will read a graph input file, Calculates the shortest path from the source point to the other node in a supported format. The source point is always the first node in the input file, and we will use the Jsonlongdoublefloatdoublevertexinputformat input format. First, create an example graph, in/tmp/tiny_graph.txt, with the following content:
[0,0, [[1,1], [3,3]]][1,0, [[0,1], [2,2], [3,1]]][2,0, [[1,2], [4,4]]][3,0, [0,3], [1,1], [4,4]]][4,0, [[3,4], [2,4]]]
Save and close the file, each line is in the following format [source_id, Source_value, [[dest_id, Edge_value], ...]], in this image, there are 5 nodes and 12 edges. Copy this input file into HDFs:
$HADOOP _home/bin/hadoop dfs-copyfromlocal/tmp/tiny_graph.txt/user/hduser/input/tiny_graph.txt$hadoop_home /bin/hadoop DFS-ls /user/hduser/input
We will use the output format of Idwithvaluetextoutputformat, each line contains source_id length for each node (the source point defaults to 0 length), you can run this example by using the following command:
$HADOOP _home/bin/hadoop jar $GIRAPH _home/giraph-examples/target/giraph-examples-1.2. 0-snapshot-for-hadoop-0.20. 203.0-jar-with-dependencies.jar Org.apache.giraph.GiraphRunner Org.apache.giraph.examples.simpleshortestpathscomputation-vif Org.apache.giraph.io.formats.jsonlongdoublefloatdoublevertexinputformat-vip/user/hduser/input/tiny_graph.txt- VOF org.apache.giraph.io.formats.idwithvaluetextoutputformat-op/user/hduser/output/shortestpaths-w 1
Remember, this task is calculated using a single worker, using the parameter-W to identify. In order to get more information about running the Giraph task. You can run the following command:
$HADOOP _home/bin/hadoop jar $GIRAPH _home/giraph-examples/target/giraph-examples-1.2. 0-snapshot-for-hadoop-0.20. 203.0-jar-with-dependencies.jar org.apache.giraph.giraphrunner-h
You can monitor the progress of your giraph tasks from the Jobtracker Web GUI interface. As soon as the task is finished, you can view the results by using the following command:
$HADOOP _home/bin/hadoop DFS-cat Less
Giraph Getting Started