Giraph Getting Started

Source: Internet
Author: User

Profile

This is an introductory tutorial for Giraph, which is used primarily to run a small number of input giraph programs and is not intended for production environments.

In this tutorial, we will deploy a single-node, pseudo-distributed Hadoop cluster on a physical machine line. This node is both master and slave. That is, it will run the Namenode,secondarynamenode,jobtracker,datanode and Tasktracker Java processes, and we will also deploy Giraph to this machine and deploy using the software and configuration below.

Ubuntu Server 14.04 LTS

Hardware:

Admin Account:

IP address:12.1.62.152

Network mask:255.255.255.0

Apache Hadoop 0.20.203.0-rc1

Apache Giraph 1.2.0-snapshot

Deploying Hadoop

We will deploy a single-node, pseudo-distributed Hadooper cluster, first of all, to install Java 1.6 or later, and to verify the installation:

sudo Install openjdk-7--version

You should be able to see your Java version information, the full Java is installed in /usr/lib/jvm/java-7-openjdk-amd64, in this directory you can find the Java bin and the Lib directory.

After doing this, create a dedicated Hadoop group and a new user HDUser, then add this user HDUser to the Hadoop group.

sudo AddGroup Hadoop is adding groups " Hadoop " 1004 )... Complete. 
[Email protected]:~$sudoAddUser--ingroup Hadoop hduser adding users"HDUser"... Adding new users"HDUser"(1003) to the group"Hadoop"... Create a home directory"/home/hduser"... are starting from"/etc/skel"Copy files ... Enter a new UNIX password: Re-enter the new UNIX password:passwd: Successfully updated password is changing hduser user information Please enter a new value or hit Enter to use the default full name []: Room number []: Work phone []: Home Phone []: Other []: Is this information correct? [Y/n] Y

Should now download and extract hadoop-0.20.203.0rc1 from Apache archives (this is the default Hadoop version in Giraph)

 su - Xxxcd /usr/local  sudo  wget  Http:// archive.apache.org/dist/hadoop/core/hadoop-0.20.203.0/ hadoop-0.20.203.0rc1.tar.gz  sudo  tar  xzf hadoop-0.20 . 203 . 0rc1. tar  .gz  sudo  MV  hadoop-0.20 . 203.0   Hadoop  sudo  chown -R hduser:hadoop Hadoop 

When the installation is complete, switch to user HDUser and edit the user's $HOME/.BASHRC with the following content

Export hadoop_home=/usr/local/hadoopexport java_home=/usr/lib/jvm/java-7-openjdk-amd64

This is primarily used to set Hadoop/java-related environment variables, and when finished, edit the $hadoop_home/conf/hadoop-env.sh file with the following:

Export java_home=/usr/lib/jvm/java-7-openjdk-amd64export hadoop_opts=- Djava.net.preferipv4stack=True

The second line primarily forces Hadoop to use IPV4, although IPV6 is configured on the machine. Because Hadoop needs to store temporary files while computing, you need to create a temporary directory with the following command, which is created for FS and HDFs files.

su –xxx sudo mkdir -p/app/hadoop/tmpsudochown hduser:hadoop/app/hadoop/tmp  sudochmod /app/hadoop/tmp

Make sure that the/etc/hosts file has the following two lines (add or modify them if they do not exist)

127.0. 0.1        localhostxxx.xxx.xxx.xxx   hdnode01

Although we can use localhost for connectivity in this single-node cluster, using hostname is often a better choice (for example, you might add a new node, transform your single node, pseudo-distributed cluster to multiple nodes, distributed clusters)

Now edit the HADOOP configuration files Core-site.xml,mapred-site.xml and Hdfs-site.xml, which are in the $hadoop_home/conf directory. Add the following content between <configuration></configuration>.

Edit Core-site.xml:

<property><name>hadoop.tmp. dir</name><value>/app/hadoop/tmp</value></property><property> <name >fs.default.name</name> <value>hdfs://</property>

Edit Mapred-site.xml:

<property><name>mapred.job.tracker</name> <value>hdnode01:54311</value> </property><property><name>mapred.tasktracker.map.tasks.maximum</name><value> 4</value></property><property><name>mapred.map.tasks</name><value >4</value></property>

By default, Hadoop allows 2 mappers to run simultaneously, but Giraph believes we can run 4 mappers at the same time. For this single-node, pseudo-distributed deployment, we need to add the latter two properties in the Mapred-site.xml file to meet this need, otherwise, some unit tests of giraph will fail.

Edit Hdfs-site.xml File:

<property><name>dfs.replication</name> <value>1</value> </property>

Here you just set up a copy of the storage HDFs file, because you only have one data nodes, the default value is 3, and if you do not change this value, you will receive a runtime exception.

The next step is to set up SSH for the HDUser user so that you don't need to enter a password each time the SSH connection is turned on.

su –hduser Ssh-keygen "" cat $HOME/. ssh/id_rsa.pub >> $HOME/. ssh/authorized_keys

Then use SSH to connect to the HDNODE01 under the HDUser user (you must use HDNODE01 here, because we use this node's hostname in the Hadoop configuration file), and you will be prompted for the password when you first use the user to connect to the node. When the password is being alerted, the public RSA key is actually stored in the $home/.ssh/known_hosts, and you will not need to enter the password later when you connect using SSH.

Edit $hadoop_home/conf/masters Now:

Hdnode01

Similarly, use the following content to edit $hadoop_home/conf/slaves:

Hdnode01

These edits are to set up a single-node, pseudo-distributed Hadoop cluster, including a master and a slave, which are on the same physical machine. If you want to deploy a multi-node, distributed HADOOP Cluster, you should add additional data nodes (for example, Hdnode02, hdnode03) in the $hadoop_home/conf/slaves file when you complete the above steps on each node.

In order to initialize HDFS, format it, you need to execute the following command:

$HADOOP _home/bin/hadoop Namenode-format

Then open the HDFs and map/reduce processes, using the following order of execution:

$HADOOP _home/bin/start-dfs. SH $HADOOP _home/bin/start-mapred. SH

Make sure all Java processes are running and execute the following command:

JPs

It will output the following results:

9079 NameNode 9560 Jobtracker 9263 DataNode 9453 Secondarynamenode 16316 Jps 9745 Tasktracker

To stop the process, close the process in reverse order of opening ($HADOOP _home/bin/stop-*.sh script). It is important that you will not lose your data and now has completed a single node, pseudo-distributed Hadoop cluster.

Run a map/reduce task

Now that we have a Hadoop cluster running, we can run the Map/reduce task. We will use the example of WordCount, which reads the text file and calculates the number of occurrences of each word. Input is a text file, output is also a text file, output each line contains a word and the number of occurrences of the word, tab-delimited. This example compresses in $hadoop_home/hadoop-examples-0.20.203.0.jar. Now let's get started, download a larger UTF-8 file into the temp directory and copy it to HDFs. Then make sure it is copied successfully.

cd/tmp/wget http://www.gutenberg.org/cache/epub/132/pg132.txt$HADOOP _home/bin/hadoop dfs-copyfromlocal/tmp/pg132.txt/user/hduser/input/pg132.txt$hadoop_home/bin/hadoop dfs-ls / User/hduser/input

After you have finished, you can run the WordCount example, in order to start a map/reduce task, you can use the $hadoop_home/bin/hadoop jar command:

$HADOOP _home/bin/hadoop jar $HADOOP _home/hadoop-examples-0.20. 203.0. Jar Wordcount/user/hduser/input/pg132.txt/user/hduser/output/wordcount

You can monitor the progress of your tasks using the Web UI:

NameNode daemon:http://hdnode01:50070

Jobtracker daemon:http://hdnode01.50030

Tasktracker daemon:http://hdnode01.50060

As soon as the task is complete, you can detect the output with the following command:

$HADOOP _home/bin/hadoop DFS-cat Less

Deploying Giraph

We will deploy giraph in order to build Giraph from repository. You need to install Git and Maven3 first, by running the following command:

su - hdadminsudoinstall  gitsudoinstall- Version

Make sure you install Maven3 or higher version, giraph use Munge plugin, this plugin needs Maven3, in order to support multiple versions of Hadoop, Web site plugin also needs Maven3, you can clone giraph from GitHub:

cd/usr/local/sudo git clone https://github.com/apache/giraph.gitsudo  Chown -R hduser:hadoop giraphsu -hduser

After modifying the user HDUser's $HOME/.BASHRC file, use the following content:

Export Giraph_home=/usr/local/giraph

Save and close the file, and then verify, compile, test, compress giraph as a jar file by running the following command:

SOURCE $HOME/-dskiptests

Parameter-dskiptests will skip the test session and take some time to run the first time, because maven will download some files to your local repository. You may have to perform many times to succeed, because the remote server may time out, as long as packaging succeeds, you will have a giraph Core JAR, where it is $giraph_home/giraph-core/target/ giraph-1.2.0-snapshot-for-hadoop-0.20.203.0-jar-with-dependencies.jar,giraph example jar file, its location is $giraph_home/ giraph-examples/target/ Giraph-examples-1.1.0-snapshot-for-hadoop-0.20.203.0-jar-with-dependencies.jar, now that you have finished deploying Giraph.

Run a giraph task

As soon as we're done deploying Giraph and Hadoop, we can start running our first giraph task, and we'll use the simpleshortestpathscomputation example, which will read a graph input file, Calculates the shortest path from the source point to the other node in a supported format. The source point is always the first node in the input file, and we will use the Jsonlongdoublefloatdoublevertexinputformat input format. First, create an example graph, in/tmp/tiny_graph.txt, with the following content:

[0,0, [[1,1], [3,3]]][1,0, [[0,1], [2,2], [3,1]]][2,0, [[1,2], [4,4]]][3,0, [0,3], [1,1], [4,4]]][4,0, [[3,4], [2,4]]]

Save and close the file, each line is in the following format [source_id, Source_value, [[dest_id, Edge_value], ...]], in this image, there are 5 nodes and 12 edges. Copy this input file into HDFs:

$HADOOP _home/bin/hadoop dfs-copyfromlocal/tmp/tiny_graph.txt/user/hduser/input/tiny_graph.txt$hadoop_home /bin/hadoop DFS-ls /user/hduser/input

We will use the output format of Idwithvaluetextoutputformat, each line contains source_id length for each node (the source point defaults to 0 length), you can run this example by using the following command:

$HADOOP _home/bin/hadoop jar $GIRAPH _home/giraph-examples/target/giraph-examples-1.2. 0-snapshot-for-hadoop-0.20. 203.0-jar-with-dependencies.jar Org.apache.giraph.GiraphRunner Org.apache.giraph.examples.simpleshortestpathscomputation-vif Org.apache.giraph.io.formats.jsonlongdoublefloatdoublevertexinputformat-vip/user/hduser/input/tiny_graph.txt- VOF org.apache.giraph.io.formats.idwithvaluetextoutputformat-op/user/hduser/output/shortestpaths-w  1

Remember, this task is calculated using a single worker, using the parameter-W to identify. In order to get more information about running the Giraph task. You can run the following command:

$HADOOP _home/bin/hadoop jar $GIRAPH _home/giraph-examples/target/giraph-examples-1.2. 0-snapshot-for-hadoop-0.20. 203.0-jar-with-dependencies.jar org.apache.giraph.giraphrunner-h

You can monitor the progress of your giraph tasks from the Jobtracker Web GUI interface. As soon as the task is finished, you can view the results by using the following command:

$HADOOP _home/bin/hadoop DFS-cat Less

Giraph Getting Started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.