[Learn More-hadoop] PHP script call for hadoop

Last Update:2018-12-04 Source: Internet

Author: User

Tags hadoop mapreduce hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In principle, hadoop supports almost any language.
Link: http://rdc.taobao.com/team/top/tag/hadoop-php-stdin/
Use PHP to write hadoop mapreduce programs

Posted by Yan jianxiang on September th, 2011

Hadoop itself is written in Java. Therefore, writing mapreduce to hadoop naturally reminds people of Java

However, There Is A contrib in hadoop called hadoop streaming, which is a small tool that provides streaming support for hadoop to support any standard Io
(Stdin, stdout) executable programs can become hadoop mapper or Reducer

Example: hadoop jar hadoop-streaming.jar-input some_input_dir_or_file-output some_output_dir-mapper/bin/cat-CER/usr/bin/WC

In this example, the cat and WC tools provided by Unix/Linux are used as mapper/reducer. Is it amazing?

If you are used to some dynamic languages, use them to write mapreduce. It is no different from the previous programming. hadoop is just a framework for running it, below I will demonstrate how to use PHP to implement word
Counter mapreduce.

Find streaming jar

Hadoop root directory is not hadoop-streaming.jar, because streaming is A contrib, so go to The contrib to find, take the hadoop-0.20.2 as an example, it is here:

$ Hadoop_home/contrib/streaming/hadoop-0.20.2-streaming.jar

Write mapper

Create a new wc_mapper.php file and write the following code:

#!/usr/bin/php<?php    $in = fopen(“php://stdin”, “r”);    $results = array();    while ( $line = fgets($in, 4096) ){        $words = preg_split(‘/\W/’, $line, 0, PREG_SPLIT_NO_EMPTY);        foreach ($words as $word)            $results[] = $word;    }    fclose($in);    foreach ($results as $key => $value){        print “$value\t1\n”;    }

The general meaning of this Code is: Find out the words in each line of text input, and"
Hello 1
World 1 ″
Output in this form.

It is basically no different from the php I wrote earlier, right? It may seem a little strange to you in two places:

Php as an executable program

#! /Usr/bin/PHP "tells Linux to use the/usr/bin/PHP program as the interpreter of the following code. People who have written Linux Shell should be familiar with this method. The first line of each shell script is like this :#! /Bin/bash ,#! /Usr/bin/Python

With this line, after saving this file, you can directly treat wc_mapper.php as Cat, grep, and execute the following command:./wc_mapper.php.

Use stdin to receive input

PHP supports multiple parameter input methods. The most familiar method is to retrieve the parameters passed through the web from the $ _ Get, $ _ post super global variables, the second is to take the parameters passed in through the command line from $ _ server ['argv']. Here, stdin is input using standard.

Its usage is as follows:

On the Linux console, enter./wc_mapper.php

Run wc_mapper.php and wait for the user's keyboard input status on the console

Enter text on the keyboard

Press Ctrl + D to terminate the input. wc_mapper.php starts to execute the real business logic and outputs the execution result.

So where is stdout? Print itself is already stdout, which is no different from the web program and CLI script we used to write.

Write CER

Create a new wc_reducer.php file and write the following code:

#!/usr/bin/php<?php    $in = fopen(“php://stdin”, “r”);    $results = array();    while ( $line = fgets($in, 4096) ){        list($key, $value) = preg_split(“/\t/”, trim($line), 2);        $results[$key] += $value;    }    fclose($in);    ksort($results);    foreach ($results as $key => $value){        print “$key\t$value\n”;    }

The purpose of this Code is to count the number of times each word appears, and"
Hello 2
World 1 ″
Output in this form

Use hadoop to run the sample text for upload statistics

hadoop fs -put *.TXT /tmp/input

Execute PHP mapreduce program in streaming mode

Hadoop jar hadoop-0.20.2-streaming.jar-Input/tmp/input-output/tmp/output-mapper wc_mapper.php absolute path-reducer wc_reducer.php absolute path

Note:

The input and output directories are on HDFS.

Mapper and reducer are the paths on the local machine. Be sure to write the absolute path instead of the relative path. In this case, hadoop reports an error saying that mapreduce programs cannot be found.

View results

hadoop fs -cat /tmp/output/part-00000

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More