Mapreduce Working Principles
Body:1. mapreduce job running process
Process Analysis:
1. Start a job on the client.
2. Request a job ID from jobtracker.
3. Copy the resource files required for running the job to HDFS, including the jar files packaged by the mapreduce program, configuration files, and input Division information calculated by the client.
Input data is as follows: separated by \ t
0-3 years old parenting encyclopedia book-5 V Liquid Level Sensor 50-5 bearings 20-6 months milk powder-6 months C2C Report-6 months online shopping rankings-6 months milk powder market prospects-6 months formula milk powder 230.001g E tianping 50.01t aluminum furnace 20.01 tons of melting Aluminum Alloy Furnace 20.03 tons of magnesium furnace 250.03 tons of Induction Cooker 11Here, the left side is the search term and the right side is the category, w
Label: style blog color strong SP file data Div on
In the previous article, I briefly talked about HDFS. In simple terms, HDFS is a big brother called "namenode". With a group of younger siblings called "datanode", HDFS has completed the storage of a pile of data, the eldest brother is responsible for the directory for storing data, while the younger brother is responsible for the real storage of data. The eldest brother and the younger brother are actually a computer, and they are interconnecte
Mapreduce and Spark are the two core of data processing layer, understand and learn big data must focus on the link, according to their own experience and everyone to do the knowledge sharing. 650) this.width=650; "Src=" Http://s5.51cto.com/wyfs02/M00/8B/2B/wKioL1hGbEiSjW3wAAEP-Bn8CcE114.jpg-wh_500x0-wm_3 -wmp_4-s_2651010867.jpg "title=" 11111.jpg "alt=" Wkiol1hgbeisjw3waaep-bn8cce114.jpg-wh_50 "/>First Look atMapreduce, its two most essential process
This section mainly analyzes the principles and processes of mapreduce.
Complete release directory of "cloud computing distributed Big Data hadoop hands-on"
Cloud computing distributed Big Data practical technology hadoop exchange group:312494188Cloud computing practices will be released in the group every day. welcome to join us!
You must at least know the following points about mapreduce:
1.
= filesystem.getlocal (conf);Set input directory and output filePath InputDir = new Path (args[0]);Path hdfsfile = new Path (args[1]);try{ Get a list of local filesfilestatus[] Inputfiles = Local.liststatus (InputDir);Generate HDFs output streamFsdataoutputstream out = Hdfs.create (Hdfsfile ();for (int i=0; iSystem.out.println (Inputfiles[i].getpath (). GetName ());Open Local input streamFsdatainputstream in = Local.open (Inputfiles[i].getpath ());byte buffer[] = new byte[256];int bytesread = 0
MapReduce implements a simple word counting function.One, get ready: Eclipse installs the Hadoop plugin:Download the relevant version of Hadoop-eclipse-plugin-2.2.0.jar to Eclipse/plugins.Second, realize:New MapReduce ProjectMap is used for word segmentation, reduce count. PackageTank.demo;Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.h
Hadoop mapreduce sequencing principle Hadoop Case 3 Simple problem----sorting data (Entry level)"Data Sorting" is the first work to be done when many actual tasks are executed,such as student performance appraisal, data indexing and so on. This example and data deduplication is similar to the original data is initially processed, for further data operations to lay a good foundation. Enter this example below.1. Requirements DescriptionSorts the data in
The command to run the MapReduce jar package is the Hadoop jar **.jar
The command to run the jar package for the normal main function is Java-classpath **.jar
Because I have not known the difference between the two commands, so I stubbornly use Java-classpath **.jar to start the MapReduce. Until today there are errors.
Java-classpath **.jar is to make the jar package run locally, then
MongoDB is a large data environment for the birth of a large amount of data to save the relational database, for a large number of data, how to do statistical operations is very important, then how to count some data from the MongoDB?
In MongoDB, we provide three ways of aggregating data:
(1) Simple user aggregation function;
(2) using aggregate for statistics;
(3) using MapReduce for statistics;
Today we first talk about how
Abstract: The MapReduce program processes a patent data set.
Keywords: MapReduce program patent Data Set
Data Source: Patent reference Data set Cite75_99.txt. (the dataset can be downloaded from the URL http://www.nber.org/patents/)
Problem Description:
Read the patent reference dataset and reverse it. For each patent, find the patent that cites it and merge it. TOP5 output results are as follows:
1 3964859
Sorting can be sorted into four sorts:General sortPartial sortGlobal sortingTwo orders (for example, there are two columns of data, and the first column is the same, you need to sort the second column.) ) General Sort
The general sort is mapreduce itself with the sort function;The text object is not suitable for sorting, intwritable,longwritable and other objects that implement the Writablecomparable type can be sorted; Partial sorting
The order of ke
1iiiiiiiiiijjjjjjjjjjkkkkkkkkkkllllllllllmmmmmmmmmmnnnnnnnnnnoooooooooopppppppp w[o| |:NH, 2qqqqqqqqqqrrrrrrrrrrssssssssssttttttttttuuuuuuuuuuvvvvvvvvvvwwwwwwwwwwxxxxxxxx ^Eu)
Describe: Each line is a piece of data. Each piece, consisting of 2 parts, is preceded by a key consisting of 10 characters, followed by a 80-character value.
Sort tasks: Order by key.
So where does 1TB of data come from? The answer is generated by the program, with a mapreduce
MapReduce is a pattern borrowed from a functional programming language, and in some scenarios it can greatly simplify the code. First look at what is MapReduce:
MapReduce is a software architecture proposed by Google for parallel operations in large-scale datasets (larger than 1TB). Concepts such as "map" and "Reduce", and their main ideas, are borrowed from fun
Configure Hadoop MapReduce development environment 1 with Eclipse on Windows. System environment and required documents
Windows 8.1 64bit
Eclipse (Version:luna Release 4.4.0)
Hadoop-eclipse-plugin-2.7.0.jar
Hadoop.dll Winutils.exe
2. Modify the hdfs-site.xml of the master nodeAdd the following contentproperty> name>dfs.permissionsname> value>falsevalue>property>Designed to remove permission checks because I configure
So that any executable program supporting standard I/O (stdin, stdout) can become hadoop er or reducer. For example:Copy codeThe Code is as follows:Hadoop jar hadoop-streaming.jar-input SOME_INPUT_DIR_OR_FILE-output SOME_OUTPUT_DIR-mapper/bin/cat-CER/usr/bin/wc
In this example, the cat and wc tools provided by Unix/Linux are used as mapper/reducer. Is it amazing?
If you are used to some dynamic languages, use them to write mapreduce. It is no differen
How to Use ArrayWritable in MapReduce, mapreducewritable
When writing a MapReduce program, the data transmitted between Map and Reduce must be of the ArrayList type. During debugging and running, the following error occurs:
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.io.ArrayWritable.
After querying the API documentation on the official website, you can find the followi
Analyzing the MapReduce execution processWhen MapReduce runs, it reads the data files in HDFs through the Mapper run task, and then calls its own method, processes the data, and outputs it. The reducer task receives the data output from the Mapper task as its input data, calls its own method, and finally outputs it to the HDFs file.Mapper the execution process of a taskeach mapper the task is a Java process
Recently in learning HBase, in seeing how to use MapReduce to operate hbase, here are a few things to look at, specifically, you can refer to the official Web document description. Official Document Connection: Http://hbase.apache.org/book.html. By learning my own way of manipulating hbase on MapReduce can be seen as the map process is responsible for the read process, reduce is responsible for the process
The previous article describedhadOOPone of the core contentHDFS, isHadoopDistributed Platform Foundation, and this speaks ofMapReduceis to make the best useHdfsdistributed, improved algorithm model for operational efficiency ,Map(Mapping)and theReduce (return to about)the two main stages areKey-value pairs as inputs and outputs, all we need to do is to,value>do the processing we want. Seemingly simple but troublesome, because it is too flexible. First, OK, Let's take a look at the two graphs be
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.