Write Hadoop MapReduce program in PHP

Source: Internet
Author: User
Tags hadoop mapreduce
Hadoop stream

Although Hadoop is written in java, Hadoop provides a Hadoop stream, which provides an API that allows you to write map and reduce functions in any language.
The key to Hadoop flow is that it uses the standard UNIX stream as the interface between the program and Hadoop. Therefore, any program can read data from the standard input stream and write data to the standard output stream, then, you can use the Hadoop stream to write MapReduce Program map functions and reduce functions in any language.
E. g.: bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper. php-reducer/usr/local/hadoop/reducer. php-input test/*-output out4
Hadoop stream introduced package: hadoop-streaming-0.20.203.0.jar, Hadoop root directory is not hadoop-streaming.jar, because streaming is A contrib, so to go to The contrib to find, take hadoop-0.20.2 as an example, it is here:
-Input: Specifies the path of the input hdfs file.
-Output: Specifies the path of the output hdfs file.
-Mapper: Specifies the map function.
-CER: Specifies the reduce function.

Mapper Function

Write the following code to the mapper. php file:

#!/usr/local/php/bin/php<?php$word2count = array();// input comes from STDIN (standard input)// You can this code :$stdin = fopen(“php://stdin”, “r”);while (($line = fgets(STDIN)) !== false) {    // remove leading and trailing whitespace and lowercase    $line = strtolower(trim($line));    // split the line into words while removing any empty string    $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);    // increase counters    foreach ($words as $word) {        $word2count[$word] += 1;    }}// write the results to STDOUT (standard output)// what we output here will be the input for the// Reduce step, i.e. the input for reducer.pyforeach ($word2count as $word => $count) {    // tab-delimited    echo $word, chr(9), $count, PHP_EOL;}?>

The general meaning of this Code is: Find out the words in each line of text input, and"

Hello 1
World 1 ″

Output in this form.

It is basically no different from the php I wrote earlier, right? It may seem a little strange to you in two places:

PHP as an executable program

The first line

#!/usr/local/php/bin/php

Tell linux to use #! /Usr/local/php/bin/php as the interpreter of the following code. People who have written linux shell should be familiar with this method. The first line of each shell script is like this :#! /Bin/bash ,#! /Usr/bin/python

With this line, after saving this file, you can directly use mapper. php as the cat and grep commands and execute:./mapper. php.

Use stdin to receive input

PHP supports multiple parameter input methods. The most familiar method is to retrieve the parameters passed through the Web from the $ _ GET, $ _ POST super global variables, the second is to take the parameters passed in through the command line from $ _ SERVER ['argv']. Here, stdin is input using standard.

Its usage is as follows:

Enter./mapper. php In the linux Console

Mapper. php running, console enters the waiting user keyboard input status

Enter text on the keyboard

Press Ctrl + D to terminate the input. mapper. php starts to execute the real business logic and outputs the execution result.

So where is stdout? Print itself is already stdout, which is no different from the web program and CLI script we used to write.

CER Function

Create the CER Cer. php file and write the following code:

#!/usr/local/php/bin/php<?php$word2count = array();// input comes from STDINwhile (($line = fgets(STDIN)) !== false) {    // remove leading and trailing whitespace    $line = trim($line);    // parse the input we got from mapper.php    list($word, $count) = explode(chr(9), $line);    // convert count (currently a string) to int    $count = intval($count);    // sum counts    if ($count > 0) $word2count[$word] += $count;}// sort the words lexigraphically//// this set is NOT required, we just do it so that our// final output will look more like the official Hadoop// word count examplesksort($word2count);// write the results to STDOUT (standard output)foreach ($word2count as $word => $count) {    echo $word, chr(9), $count, PHP_EOL;}?>

The purpose of this Code is to count the number of times each word appears, and"

Hello 2

World 1 ″

Output in this form

Run with Hadoop

Put the file into Hadoop DFS:

bin/hadoop dfs -put test.log test

Execute the php program to process the text (Execute PHP mapreduce program in Streaming mode:):

Bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper. php-reducer/usr/local/hadoop/reducer. php-input test/*-output out

Note:

1) the input and output directories are on hdfs.

2) mapper and reducer are the paths on the local machine. Be sure to write the absolute path instead of the relative path. In this case, hadoop reports that the mapreduce program cannot be found.

3) mapper. php and reducer. php must be copied to the same path on all DataNode servers. php has been installed on all servers, and the installation path is the same.

View results

Bin/hadoop d fs-cat/tmp/out/part-00000

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.