Using PHP to write a mapreduce program for Hadoop

Source: Internet
Author: User
Tags vars

Using PHP to write a mapreduce program for Hadoop

Hadoop Stream

Although Hadoop is written in Java, Hadoop provides a stream of Hadoop, and Hadoop streams provide an API that allows users to write map functions and reduce functions in any language.
The key to Hadoop flow is that it uses the UNIX standard stream as the interface between the program and Hadoop. Therefore, any program that can read data from a standard input stream and write data to a standard output stream can write the map and reduce functions of the MapReduce program in any language using the Hadoop stream.
Example: Bin/hadoop jar Contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper.php-reducer/ Usr/local/hadoop/reducer.php-input test/*-output Out4
Package introduced by Hadoop stream: Hadoop-streaming-0.20.203.0.jar, Hadoop root is no hadoop-streaming.jar, because streaming is a contrib, so go to contrib below, take hadoop-0.20.2 as an example, it is here:
-input: Indicates the path of the input HDFs file
-output: Indicates the path of the output HDFs file
-mapper: Indicates the map function
-reducer: Indicates the reduce function

Mapper function

mapper.php file, write the following code:

[PHP]View PlainCopyprint?
  1. #!/usr/local/php/bin/php
  2. <?php
  3. $word 2count = Array ();
  4. Input comes from STDIN (standard input)
  5. You can this code: $stdin = fopen ("Php://stdin", "R");
  6. while (($line = fgets (STDIN))!== false) {
  7. //Remove leading and trailing whitespace and lowercase
  8. $line = strtolower (Trim ($line));
  9. //Split the line to words while removing any empty string
  10. $words = preg_split ('/\w/', $line, 0, Preg_split_no_empty);
  11. //Increase counters
  12. foreach ($words as $word) {
  13. $word 2count[$word] + = 1;
  14. }
  15. }
  16. Write the results to STDOUT (standard output)
  17. What we output here would be the input for the
  18. Reduce step, i.e. the input for reducer.py
  19. foreach ($word 2count as $word = + $count) {
  20. //tab-delimited
  21. echo $word, chr (9), $count, Php_eol;
  22. }
  23. ?>

The approximate meaning of this code is: to find out the words in each line of text entered, and to "

Hello 1
World 1″

Output in this form.

It's basically no different from the PHP you wrote before, right, maybe a little strange to you. There are two places:

PHP as an executable program

The first line of

[PHP] view plain copy print?
  1. #!/usr/local/php/bin/php
Tell Linux to use the #!/usr/local/php/bin/php program as the interpreter for the following code. The person who wrote the Linux shell should be familiar with this notation, which is the first line of every shell script: #!/bin/bash, #!/usr/bin/python

With this line, after saving this file, you can just like this mapper.php as cat, grep command execution:./mapper.php

Using stdin to receive input

PHP supports a variety of parameters in the method, we are most familiar with should be from the $_get, $_post the Super global variable inside the parameters passed through the web, followed by the command line from $_server[' argv '] to take the parameters passed in, here, the use of the standard input stdin

Its use effect is:

In the Linux console, enter./mapper.php

mapper.php run, console enters waiting for user keyboard input status

User enters text via keyboard

The user presses CTRL + D to terminate the input, mapper.php starts executing the real business logic and outputs the execution results

So where's stdout? Print itself is stdout, no different from the way we used to write Web programs and CLI scripts.

Reducer function

Create the reducer.php file and write the following code:

[PHP]View PlainCopyprint?
  1. #!/usr/local/php/bin/php
  2. <?php
  3. $word 2count = Array ();
  4. Input comes from STDIN
  5. while (($line = fgets (STDIN))!== false) {
  6. //Remove leading and trailing whitespace
  7. $line = Trim ($line);
  8. //Parse the input we got from mapper.php
  9. List ($word, $count) = explode (chr (9), $line);
  10. //Convert count (currently a string) to int
  11. $count = intval ($count);
  12. //Sum counts
  13. if ($count > 0) $word 2count[$word] + = $count;
  14. }
  15. Sort the words lexigraphically
  16. //
  17. This set are not required, we just does it so
  18. Final output would look more like the official Hadoop
  19. Word count examples
  20. Ksort ($word 2count);
  21. Write the results to STDOUT (standard output)
  22. foreach ($word 2count as $word = + $count) {
  23. echo $word, chr (9), $count, Php_eol;
  24. }
  25. ?>

The effect of this code is to count how many times each word appears and to "

Hello 2

World 1″

This form of output

Use Hadoop to run

Put the file into DFS in Hadoop:

Bin/hadoop dfs-put Test.log Test
Execute the PHP program to process the text (the PHP mapreduce program is executed in a streaming manner :):

Bin/hadoop Jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper.php-reducer/usr/ Local/hadoop/reducer.php-input test/*-output out

Attention:

1) The input and output directories are the paths on the HDFs

2) Mapper and reducer is the path of the local machine, be sure to write the absolute path, do not write the relative path, lest the Hadoop error can not find the MapReduce program

3) mapper.php and reducer.php must be copied to the same path on all DataNode servers, all servers have PHP installed. And the installation path is the same.

View Results

Bin/hadoop D fs-cat/tmp/out/part-00000

Using PHP to write a mapreduce program for Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.