Pipes under hadoop (hadoop program development using C ++)

Source: Internet
Author: User

After one morning's efforts, I finally run the C ++ version of mapreduce in pseudo-distributed mode. The following describes the process one by one.

I. Prerequisites
1. hadoop 1.0.x has been installed on Linux (my system is centos5.5 (64-bit system), and hadoop version is 1.0.3. Other systems may be different)
2. Understand basic hadoop concepts

2. Steps (skip this step if conditions are met)
1. Modify the three file core-site.xml, hdfs-site.xml, and mapred-site.xml under the $ hadoop_install/conf directory as follows:

<?xml version="1.0"?><!-- core-site.xml --><configuration><property><name>fs.default.name</name><value>hdfs://localhost/</value></property></configuration><?xml version="1.0"?><!-- hdfs-site.xml --><configuration><property><name>dfs.replication</name><value>1</value><!--only one copy--></property></configuration><?xml version="1.0"?><!-- mapred-site.xml --><configuration><property><name>mapred.job.tracker</name><value>localhost:8021</value></property></configuration>

2. Configure SSH (that is, accessing the local machine without a password is troublesome in centos5) [Difficulties 1]

1) Confirm that the system has installed the OpenSSH server and client
The installation steps are not described here, which is not the focus of this article.
2). Confirm the configuration file of the local sshd (root permission required)
$ VI/etc/ssh/sshd_config
Find the following content and remove the annotator "#"
Rsaauthentication Yes
Pubkeyauthentication Yes
Authorizedkeysfile. Ssh/authorized_keys
3) if the configuration file is modified, restart the sshd service (root permission required)
$/Sbin/service sshd restart
4). Run the following command after logging on to the system through SSH:
$ SSH localhost
Press enter and you will be prompted to enter the password because no certificate has been generated yet.
5). To generate a public/private key for a certificate:
$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
6) test logon to SSH localhost:
$ SSH localhost
Under normal circumstances, the login is successful, and some successful login information is displayed. If the login fails, please refer to the following general debugging steps.
7). If logon fails, it is likely that the authorized_keys permission is incorrect. Add the permission.
$ Chmod 600 ~ /. Ssh/authorized_keys
8). Test logon to SSH localhost: (generally, it will succeed !)

3. format the HDFS File System

1) command: hadoop namenode-format

4. Start the daemon (DFS and mapred, you must set the first three files correctly; otherwise, errors may occur)

$ Start-dfs.sh
$ Start-mapred.sh

5. Compile the CPP file. The source code is as follows:

 

#include <algorithm>#include <limits>#include <stdint.h>#include <string>#include "hadoop/Pipes.hh"#include "hadoop/TemplateFactory.hh"#include "hadoop/StringUtils.hh"class MaxTemperatureMapper : public HadoopPipes::Mapper {public:  MaxTemperatureMapper(HadoopPipes::TaskContext& context) {  }  void map(HadoopPipes::MapContext& context) {    std::string line = context.getInputValue();    std::string year = line.substr(15, 4);    std::string airTemperature = line.substr(87, 5);    std::string q = line.substr(92, 1);    if (airTemperature != "+9999" &&        (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {      context.emit(year, airTemperature);    }  }};class MapTemperatureReducer : public HadoopPipes::Reducer {public:  MapTemperatureReducer(HadoopPipes::TaskContext& context) {  }  void reduce(HadoopPipes::ReduceContext& context) {    int maxValue = INT_MIN;    while (context.nextValue()) {      maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue()));    }    context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));  }};int main(int argc, char *argv[]) {  return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,                               MapTemperatureReducer>());}

6. compile the MAKEFILE file. The source code is as follows (Note: This file is a little different from the reference book "hadoop authoritative guide Chinese Version 2". "-lcrypto" is added ", change "-M32" to "-M64 ")

CC = g++CPPFLAGS = -m64 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/includemax_temperature: max_temperature.cpp $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes -lcrypto -lhadooputils -lpthread -g -O2 -o $@

7. Install GCC, G ++, and standard C library

Install GCC, G ++ steps
1). Add a software update source.
The update method is as follows:
First, go to the yum source configuration directory.
CD/etc/yum. Repos. d

Yum source of the Backup System
MV CentOS-Base.repo CentOS-Base.repo.save

Download other faster Yum sources
Sudo wget http://centos.ustc.edu.cn/CentOS-Base.repo
Sudo wget http://mirrors.163.com/.help/CentOS-Base-163.repo
Sudo wget http://mirrors.sohu.com/help/CentOS-Base-sohu.repo

After updating the yum source, we recommend that you update it so that the operation takes effect immediately.
Yum makecache

2) install gcc
Sudo Yum install gcc-y

3) install g ++ (SUDO Yum install g ++ reports that the G ++ package cannot be found. The original package name is gcc-C ++. Sudo Yum install gcc-C ++)
Sudo Yum install gcc-C ++-y

4) install the Standard C library

Sudo Yum install glibc-devel-y

8. Install OpenSSL. The command is as follows:

CD/usr/local/src
Sudo tar zxvf openssl-0.9.8l.tar.gz
Sudo CD openssl-0.9.8l
Sudo./config
Sudo make
Make install
Sudo CP libcrypto. A/usr/local/lib
Sudo CP libssl. A/usr/local/lib

9. Compile the source file, that is, max_temperature.cpp.

1) set environment variable: exoprt platform = Linux-amd64-64

2) execute the make command in the directory where max_temperature.cpp is located.

10. upload files to HDFS

1) Upload the Local Executable File max_temperature to the bin/max_temperature directory of HDFS: hadoop FS-put max_temperature bin/max_temperature

2) Upload data to HDFS: hadoop FS-put sample.txt

The content of 32.16sample.txt is as follows:

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+999999999990043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+999999999990043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+999999999990043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+999999999990043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

11. Run the hadoop pipes command to run the job. The command is as follows:

Hadoop pipes-D hadoop. Pipes. java. recordreader = true-D hadoop. Pipes. java. recordwriter = true-input sample.txt-output-program bin/max_temperature

12. view the job execution result:

Hadoop FS-cat output /*

If the result is

1949 111
1950 22

Indicates that the above 11 items are correctly set. (Celebrate ,...~ O (worker _ worker) O ~...)

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.