I. Overview of the MapReduce
MapReduce, referred to as Mr, distributed computing framework, Hadoop core components. Distributed computing framework There are storm, spark, and so on, and they are not the ones who replace who, but which one is more appropriate.
MapReduce is an off-line computing framework, Storm is a streaming computing framework, and Spark is a memory computing framework for projects that get results quickly.
Second, the design concept of MapReduce
What is distributed computing for mobile computing, rather than moving data
Iii. working principle of MapReduce
Excerpt from: http://weixiaolu.iteye.com/blog/1474172
Process Analysis
Map End
1. Each input shard will have a map task to handle, by default, the size of one block in HDFs (64M by default) is a shard, and of course we can set the size of the block. The result of the map output is temporarily placed in a ring memory buffer (the buffer size defaults to 100M, controlled by the Io.sort.mb property), when the buffer is about to overflow (default is 80% of the buffer size, Controlled by the Io.sort.spill.percent property, an overflow file is created in the local file system and the data in that buffer is written to this file.
2. Before writing to disk, the thread first divides the data into the same number of partitions based on the number of reduce tasks, which is the data for one partition for a reduce task. This is done to avoid some of the reduce tasks being allocated to large amounts of data, while some reduce tasks have little or no data embarrassment. In fact, partitioning is the process of hashing data. The data in each partition is then sorted, and if combiner is set at this point, the sorted result is combia and the purpose is to have as little data as possible to write to the disk.
3. When the map task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The process of merging is continuously sorted and combia, with two purposes: 1. Minimize the amount of data that is written to the disk each time; 2. Minimize the amount of data transmitted over the next replication phase of the network. Finally, it is merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed, as long as the mapred.compress.map.out is set to true.
4. Copy the data from the partition to the corresponding reduce task. One might ask: How does the data in the partition know which of its corresponding reduce is? In fact, the map task has been and its father Tasktracker keep in touch, and Tasktracker has been and jobtracker keep heartbeat. So the macro information in the whole cluster is saved in the Jobtracker. OK, as long as the reduce task gets the corresponding map output location to Jobtracker.
Here, the map end is analyzed. What the hell is shuffle? Shuffle Chinese means "shuffle", if we look at this: a map generated data, the result of the hash process is allocated to different reduce tasks, is not a process of data shuffling.
Reduce End
1. Reduce receives data from different map missions, and the data from each map is ordered. If the amount of data accepted by the reduce side is quite small, is stored directly in memory (the buffer size is controlled by the Mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose), if the amount of data exceeds a certain percentage of the buffer size (by Mapred.job.shuffle.merg E.percent), the data is merged and then overflowed to disk.
2. As overflow files increase, background threads merge them into a larger, ordered file to save time for subsequent merges. In fact, regardless of the map or the reduce side, MapReduce is repeated to perform the sort, merge operation, now finally understand why some people say:Sorting is the soul of Hadoop.
3. The process of merging produces many intermediate files (written to disk), but MapReduce makes the data written to disk as small as possible, and the result of the last merge is not written to disk, but is entered directly into the reduce function.
iv. installation of MapReduceThere are now node1,node2,node3 three servers, where Node1 is Namenode,node2,node3 Datanode,node2 is also Secondarynamenode, and MapReduce is installed into the Node1 server
1. Close Node1 Firewall
2, Node1 on the configuration of the MapReduce master server Conf/mapred_site.xml, from the server is the Datanode server
<configuration>
<property>
<name>mapred.job.tracker</name>
<value> node1:9001</value>
</property>
</configuration>
3, copy Node1 Mapred_site.xml to Node2,node3
SCP./mapred-site.xml root@node2:~/hadoop-1.2.1/conf/
SCP./mapred-site.xml root@node3:~/hadoop-1.2.1/conf/
4, start the Node1, enter the Hadoop-1.2.1/bin execution./hadoop Namenode-format./start-all.sh
Results
Node1
Node2
Node3