Hadoop practice-hadoop job Optimization Parameter Adjustment and principles in the intermediate and intermediate stages

Source: Internet
Author: User
Part 1: core-site.xml • core-site.xml is the core attribute file of hadoop, the parameter is the core function of hadoop, independent of HDFS and mapreduce. Parameter List • FS. default. name • default value File: // • Description: sets the hostname and port of the hadoop namenode. The default value is standalone mode. If it is a pseudo-distributed file system, it must be set to HDFS: // localhost: 9000. If cluster mode is used, HDFS: // hostname: 9000 • hadoop. TMP. dir • default value/tmp/hadoop-$ {user. name} • different directories will be generated under TMP according to username • FS. checkpoint. dir • default value $ {hadoop. TMP. dir}/dfs/namesecondary • sencondary namenode image storage directory • FS. checkpoint. period • default value: 3600 (seconds) • control the checkpoint interval of secondary namenode. If the time from the last checkpoint is greater than the setting of this parameter, the checkpoint is triggered. Secondary namenode performs snapshot on the fsimage and editlog of namenode. If you frequently access hadoop or restart the downtime of namenode to reduce latency, you can set this value to a smaller value. • FS. checkpoint. size • default value: 67108864 (byte) • If hadoop is very busy, editlog may become very large in a short time, FS. checkpoint. period settings do not necessarily fully predict this situation, so the insurance practice will set this value to ensure that when the data is larger than FS. checkpoint. the size value also triggers checkpoint. • Io. file. Buffer. Size • default value: 4096 • This is the buffer size for reading and writing sequence files, which can reduce the number of I/O operations. In a large hadoop cluster, we recommend that you set it to 65536 to 131072. • IPC. Client. Connection. maxidletime • default value: 10000 (MS) • sets the maximum idle time for hadoop client connection. The default value is 10 seconds. If the network connection of the hadoop cluster is unstable, set this value to 60000 (60 seconds) • IPC. server. tcpnodelay • default value false • whether to enable the Nagle's algorithm on the hadoop server. If it is set to true, the algorithm will be disable. If it is turned off, the delay will be reduced, but the transmission of small data packets will be increased. Server site does not need this value. • Hadoop. Security. Authorization • default value: false • whether to enable account authentication. After enabling, hadoop will first confirm whether it has permissions before executing any action. Detailed permission settings are placed in the hadoop-policy.xml. For example, to enable the fenriswolf account and mapreduce group to submit M/R jobs, set security. job. submission. protocol. ACL • hadoop. security. authentication • The default value is simple • Simple, indicating that no authentication is available. hadoop uses system accounts and groups to control Q permissions. You can also specify KerberosThis part is relatively more complex than logon. It requires a Kerberos server and generates the account keytab. before executing any operation, the client must first use the kinit command to authenticate the Kerberos server, any subsequent operation is performed using the Kerberos account. • Fs. Trash. interval • default value: 0 (points) • time to clear the garbage can. The default setting is unclear, so you have to execute the • hadoop command yourself when removing files. native. lib • default value: True • by default, hadoop finds all available native libraries and automatically loads them for use. For example, libraries of the compression class, such as gzip and lzo. Part 2: hdfs-site.xml parameter list • DFS. Block. Size • default value 67108864 (bytes) • default value 64 mb per block. If it is determined that all the accessed file blocks are large, you can change them to 134217728 (128 MB ). The client can also decide the block size to use without changing the setting of the entire cluster. • DFS. safemode. Threshold. PCT • default value 0.999f • hadoop enters safe mode at startup, that is, security mode, which cannot write data. Only when 99.9% of blocks reaches the minimum number of DFS. Replication. Min (default value: 3) will it leave safe mode. When DFS. Replication. Min is set to a large value or the number of data nodes is large, it will take a long time. • DFS. namenode. handler. Count • default value 10 • set the number of namenode server threads, which communicate with other datanodes using RPC. When there are too many datanodes, RPC timeout is easily displayed. The solution is to increase the network speed or value, however, it should be noted that the number of threads also indicates that the memory consumed by namenode also increases • DFS. datanode. handler. count • default value 3 • specify the number of threads on the data node. • DFS. datanode. max. xcievers • default value 256 • This value specifies the maximum number of files that datanode can process at the same time, • DFS. datanode. du. reserved • 0 by default • the default value indicates that data nodes will use the entire disk. If it is fully written, the data cannot be written into M/R jobs. Other programs that share these directories will also be affected. We recommend that you retain at least 1073741824 (1 GB) of space. Part 3: mapred-site.xml parameter list • Io. sort. MB • default value 100 • buffer size (in MB) of the cache map intermediate result • Io. sort. record. percent • default value: 0.05 • Io. sort. MB is used to save the percentage of map output record boundaries, and other caches are used to save data • Io. sort. spill. percent • default value: 0.80 • threshold for starting spill operations by map • Io. sort. factor • 10 by default • Maximum number of streams simultaneously operated during merge operations. • Min. num. spill. for. combine • default value 3 • Minimum number of spill run by the combiner function • mapred. compress. map. output • default value false • whether to compress the intermediate map Results • mapred. map. output. compression. codec • Org. apache. hadoop. io. compress. defaultcodec • Min. num. spill. for. combine • default value 3 • Minimum number of spill run by the combiner function • mapred. compress. map. output • default value false • whether to compress the intermediate map Results • mapred. map. output. compression. codec • Org. apache. hadoop. io. compress. defaultcodec • mapred. reduce. parallel. copies • default value 5 • Maximum number of threads for each reduce to download map results in parallel • mapred. reduce. copy. backoff • default value 300 • Reduce maximum waiting time of the download thread (in Sec) • Io. sort. factor • default value 10 • Org. apache. hadoop. io. compress. defaultcodec • mapred. job. shuffle. input. buffer. percent • default value 0.7 • Percentage of reduce task heap used to cache shuffle data • mapred. job. shuffle. merge. percent • default value 0.66 • Percentage of cached memory before performing merge operations • mapred. job. reduce. input. buffer. percent • default value 0.0 • Percentage of data cached in the reduce computing phase after sort is complete

Hadoop practice-hadoop job Optimization Parameter Adjustment and principles in the intermediate and intermediate stages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.