A distributed computing and processing scheme for hadoop--mass files

Source: Internet
Author: User
Keywords nbsp; run grep
Tags class code communication computing configuration create data default

Hadoop is a Java implementation of Google MapReduce. MapReduce is a simplified distributed programming model that allows programs to be distributed automatically to a large cluster of ordinary machines. Just as Java programmers can do without memory leaks, MapReduce's run-time system solves the distribution details of input data, executes scheduling across machine clusters, handles machine failures, and manages communication requests between machines. Such a pattern allows programmers to handle the resources of a large distributed system without having to have the experience of concurrent processing or distributed systems.

Introduction

As a Hadoop programmer, what he has to do is:
1, the definition mapper, processing the input key-value pair, outputs the intermediate result.
2, the definition of reducer, optional, the intermediate results of the statute, the output of the final results.
3. Define InputFormat and OutputFormat, optionally, InputFormat converts the contents of each row of input files into Java classes for use by the Mapper function, which is not defined by default to string.
4. Define the main function, define a job inside and run it.

    Then the matter is handed over to the system.
    1. Basic concepts: Hadoop's HDFs implements Google's GFs file system, Namenode as the file system responsible for scheduling runs on each machine in Master,datanode. At the same time Hadoop implements Google's Mapreduce,jobtracker as the mapreduce of the total schedule run in Master,tasktracker runs on each machine to execute the task.

    2.main () function, create jobconf, define Mapper,reducer,input/outputformat and input output file directory, Finally, the job is submitted to Jobtracker, waiting for the job to end.

    3.JobTracker, create a InputFormat instance, call its Getsplits () method, and split the input directory file into Filesplist as Mapper task Input, generate Mapper task join queue.

    4.TaskTracker requests the next map/reduce to the Jobtracker.
      
     mapper The task first creates a recordreader from the inputformat, loops through the contents of the Filesplits to generate key and value, passes to the mapper function, and finishes the middle result as sequencefile. The
     reducer Task obtains the required intermediate content (33%) from the jetty of the Tasktracker running Mapper, Sort/merge (66 %), executes the REDUCER function, and finally writes the result directory according to OutputFormat.

Tasktracker reports a run to jobtracker every 10 seconds, and each time a Task10 second is completed, the next task is asked Jobtracker.

All data processing for the Nutch project is built on Hadoop, as detailed in scalable Computing with Hadoop.




II, programmers write code

Let's do a simple, distributed grep that simply matches the input file line-by-row, and prints the line to the output file if it fits. Because it's a simple total output, we just write the mapper function, do not write the Reducer function, and do not define Input/output Format.

Package Demo.hadoop





public class Hadoopgrep {





public static class Regmapper extends Mapreducebase implements Mapper {





private pattern;





public void Configure (jobconf job) {


pattern = Pattern.compile (Job.get ("Mapred.mapper.regex"));


  }





public void Map (writablecomparable key, writable value, outputcollector output, Reporter Reporter)


throws IOException {


String Text = ((text) value). toString ();


Matcher Matcher = pattern.matcher (text);


if (Matcher.find ()) {


output.collect (key, value);


   }


  }


 }





Private Hadoopgrep () {


}//Singleton





public static void main (string] args) throws Exception {


  


jobconf grepjob = new Jobconf (Hadoopgrep. Class);


grepjob.setjobname ("Grep-search");


grepjob.set ("Mapred.mapper.regex", args 2));





Grepjob.setinputpath (new Path (args 0));


Grepjob.setoutputpath (New Path (args 1));


Grepjob.setmapperclass (Regmapper. Class);


Grepjob.setreducerclass (Identityreducer. Class);


      


jobclient.runjob (grepjob);


 }


}


The Configure () function of the Regmapper class accepts the lookup string passed in by the main function, the map () function makes a regular match, the key is the row number, the value is the contents of the file row, and the line of the file is placed in the intermediate result.
The main () function defines the input and output directories and matching strings that are passed in by the command-line arguments, the Mapper function is the Regmapper class, and the reduce function does nothing, directly outputting the intermediate result to the Identityreducer class of the final result, running the job.


The whole code is very simple, without any detail of distributed programming.




three. Running the Hadoop program

Hadoop this aspect of the document is not comprehensive, comprehensive reference gettingstartedwithhadoop and Nutch Hadoop Tutorial Two, and then touched a lot of nails to finally complete the run up, record as follows:

3.1 Local Run mode

No distributed computing at all, no namenode,datanode practices, suitable for the first debugging code.
Unzip Hadoop, where the Conf directory is the configuration directory, The configuration file for Hadoop is Hadoop-default.xml, and if you want to modify the configuration, instead of modifying the file directly, modify the Hadoop-site.xml and reassign the attribute in Hadoop-site.xml.
The default configuration for Hadoop-default.xml is already local, without any modification, the only place in the configuration directory that has to be modified is the location of Java_home in hadoop-env.sh.


Put the compiled hadoopgrep and regmapper.class into the hadoop/build/classes/demo/hadoop/directory to find a larger log file into a directory, and then run

hadoop/bin/hadoop demo.hadoop.HadoopGrep Log files directory of any output directory grep string


View the results of the output directory and view the run log in hadoop/logs/.
Delete the output directory before running again.

3.2 operation mode of single machine cluster

Now, let's do a single cluster. Suppose to complete the setting in 3.1, this machine is named Hadoopserver
1th step. Then modify the Hadoop-site.xml and add the following:

< property >


< name > fs.default.name </name >


< value > hadoopserver:9000 </value >


</Property >


< property >


< name > mapred.job.tracker </name >


< value > hadoopserver:9001 </value >


</Property >


< property >


< name > dfs.replication </name >


< value > 1 </value >


</Property >


From then on it will run from the local file system to the HDFS system of Hadoop, MapReduce Jobtracker from the local process into a distributed task system, 9000,9001 two port number is a random choice of two free port number.

Also, if your/tmp directory is not large enough, you may want to modify the Hadoop.tmp.dir property.


2nd step. Add SSH do not enter the password to log in.

Because Hadoop needs not to enter the password SSH to dispatch, in no SU state, in its own home directory run ssh-keygen-t RSA, and then all the way back to generate keys, and then enter. SSH directory, CP id_rsa.pub Authorized_ Keys
Details can be a man ssh, at this time to perform SSH hadoopserver, do not need to enter any password will be able to enter.

3. Format Namenode, execute
Bin/hadoop Namenode-format

4. Start Hadoop
Execute hadoop/bin/start-all.sh, start Namenode,datanode,jobtracker,tasktracker on the local machine

5. Now put the log file to be found in HDFs.
Performing Hadoop/bin/hadoop Dfs can see the file operation instructions it supports.
Execute hadoop/bin/hadoop DFS put log file in, the log file directory is placed in HDFs/user/user-name/in directory

  6. Now to perform the grep operation
      Hadoop/bin/hadoop Demo.hadoop.HadoopGrep in Out
       View the run log in hadoop/logs/before performing it again. Run Hadoop/bin/hadoop Dfs rmr out delete directory.

  7. Run hadoop/bin/stop-all.sh end

  3.3 Cluster run mode
  assume that the 2nd machine name is Hadoopserver2
  1. Create the same execution user as hadoopserver. Extract Hadoop to the same directory.

  2. The same modification of the Java_home in haoop-env.sh and 3.2 of the same hadoop-site.xml

  3./home/in Hadoopserver Username/.ssh/authorized_keys copy to Hadoopserver2 to ensure hadoopserver can login without password Hadoopserver2
     scp/home/username/.ssh/authorized_keys  Username@hadoopserver2:/home/username/.ssh/authorized_keys
&NBSP
  4. Modify Hadoop-server hadoop/conf/slaves file, increase cluster node, change localhost to
    hadoop-server
    hadoop-server2

  5 hadoop-server execute hadoop/bin/start-all.sh
   will start Namenode,datanode,jobtracker,tasktracker in Hadoop-server
   start in Hadoop-server2 datanode and Tasktracker
 
  6. Now to perform the grep operation
     hadoop/bin/hadoop Demo.hadoop.HadoopGrep in Out
    run Hadoop/bin/hadoop Dfs rmr out delete out directory

  7 before running again. Run Hadoop/bin/stoP-all.sh end.
   

Iv. Efficiency

After testing, Hadoop is not a panacea, depending on the size and number of files, the complexity of processing and the number of clustered machines, the associated bandwidth, when the above four are not large, Hadoop advantage is not obvious.
For example, do not use Hadoop Java write a simple grep function to process 100M log files as long as 4 seconds, with the Hadoop local mode of operation is 14 seconds, the use of Hadoop stand-alone cluster is 30 seconds, with a dual-machine cluster 10M network port is slower, Slow to be embarrassed to speak out.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.