The artifice of Hadoop

Source: Internet
Author: User

(2-6 for performance optimization) (7-9 for function Introduction)

1. In Jobhistory, you can see some job-related information, with Start-all start Hadoop can enter the port number 8088 to view the information, but unable to enter the port number 19888 to view the history.

Only need to start jobhistory, command: mapred historyserver. If you want to stop, CTRL + C exit.

2. If there are many small files, a single file generates a mapper, the resource is wasted, the small file is preprocessed to a large file, and then the large file as input, you can save a lot of time. With Combinefileinputformat (an abstract class in the Hadoop class library), you can also package multiple files into one input unit for improved performance.

3.dfs.block.size This is the block size setting, which means that the file splits the block by a large size. In general, the size of the block also determines the number of your map. Dfs.replication is a set of replication quantities and cannot be 0. Set to 1, which is to save a copy of the cluster. Set to 2, that is, make a backup, that is, the data in the cluster has 2 copies. These two items are set in the Hdfs-site.xml configuration file.

The output of the 4.MAP phase is first stored in a memory buffer of a certain size, and if the map output exceeds a certain limit, the map task writes the result to disk, and then copies the map task to the node of the reduce task, if the volume of data is large, The intermediate data exchange takes a lot of time. You can compress the map's output data by setting the Mapred.compress.map.output property to True, and you can also set the compression format of the map output data by setting the Mapred.map.output.compression.codec property to The settings for the indented format.

5. The default value for Mapred.tasktracker.map.tasks.maximum is 2, and the default value for property Mapred.tasktracker.reduce.tasks.maximum is 2, which can be set in the Mapred-site.xml file to a more Large values to improve overall performance.

6.mapred.child.java.opts This parameter is the amount of memory that is used to configure each map or reduce. The default is 200M. For this parameter, I personally think that if the memory is 8G,CPU has 8 cores, then set to 1G on it. In fact, the memory consumption in the process of map and reduce is not very large, but if the configuration is too small, there is a possibility of "No memory can be allocated" error.

7.setup function: Called only once after a task is started. You can place duplicate processing in the map or reduce function into the setup function to initialize the global variables that may be used during the processing of the map or the reduce function, to get global variables from the job information, and to monitor the start of a task. Setup is only a global operation on a task, not a global operation for the entire job.

8.cleanup functions: The exact opposite of the setup function is performed once before the task is destroyed.

9.run functions: If you want a more complete control of the map or reduce phase, you can override this function and add your own control content like a function in a normal Java class, such as adding your own task after it is started and before the destruction is processed.

10. Constantly updated in ...

The artifice of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.