Jobconf. setnummaptasks (n) is meaningful. Combining block size affects the number of map tasks. For details, see fileinputformat. getsplits source code. Assume that mapred is not set. min. split. size. When the default value is 1, the total size of each file is calculated according to min (totalsize [total size of all files]/mapnum [mapnum set by jobconf].
Blocksize) is the size to split, not to say that the file is not split if it is smaller than the block size.
 
2. http://hadoop.hadoopor.com/thread-238-1-1.html
I wonder if you want to increase the number of MAP/reduce tasks in the entire cluster, or the number of MAP/reduce tasks that can be run in parallel on a single node? For the former, only the reduce task count is generally set, while the map task count is determined by the number of splits. For the latter, the mapred task count can be set in the configuration. tasktracker. map. tasks. maximum
Mapred. tasktracker. Reduce. Tasks. Maximum
 
In addition, the mapred. jobtracker. taskscheduler. maxrunningtasksperjob parameter is used to control the maximum number of parallel tasks in a job. This parameter indicates the maximum number of parallel tasks in the cluster.
 
3. My understanding: For more information, see the code of fileinputformat. java.
The number of map tasks depends on splitsize, and the number of map tasks that a file is divided into based on splitsize. The splitsize calculation (see the source code of fileinputformat): splitsize = math. Max (minsize, math. Min (maxsize, blocksize); and
Minsize = math. max (getformatminsplitsize (), getminsplitsize (job); that is, the minimum split size of a file in a certain format (for example, the source code sequencefile is 2000) and the minimum split size of the entire job configuration (that is, mapred in the mapred-default.xml. min. split. the value of size ).
Maxsize is mapred. Max. Split. Size (not in the mapred-default.xml, I tried, in the mapred-site.xml configuration coverage is useless, the specific usage refer:
Hadoop JAR/root/mahout-core-0.2.job Org. apache. mahout. clustering. lda. ldadriver-dmapred. max. split. size = 900 ...), if not configured, the default value is the maximum value of the long type. (Mapred. Max. Split. size is not recommended (Trial ))
Blocksize is the DFS in hdfs-default.xml. block. the value of size, which can be overwritten in the hdf-site.xml. this value must be a multiple of 512. If you want to increase the number of map tasks, you can set the DFS. block. the size is a little smaller, such as and. The formula above ensures that even if your blocksize is smaller than the minimum split size of a file in a certain format, finally, select the minimum split size of this format. If blocksize is larger than it, use blocksize as the size of splitsize.
 
Conclusion: If you want more map tasks (1), you can set DFS. Block. size to a smaller value. sequencefile 2048 is recommended... (Try) When eclipse is running, DFS. block. size is set by mapreduce in eclipse (DFS. block. size) is effective, rather than the configuration file in the hadoop Conf. However, if you run the hadoop jar command on the terminal, it should be determined by the configuration file in the hadoop Conf.
(2) Recommendation: it can be divided into multiple sequencefiles for input (use the upper directory as the input path, and the upper directory must contain a clear sequencefile). The input path ". /"or specify the upper-level directory file name
 
Number of reduce tasks:
 
You can set it through job. setnumreducetasks (n. Multiple reduce tasks will have multiple reduce results, part-r-00000, part-r-00001,... part-r-0000n
 
 
 
 
 
 
 - Increasing the number of tasks increases system overhead, load balancing, and task failure cost;
- The number of map tasks is the value of mapred. Map. Tasks. You cannot set this parameter directly. The size of input split determines how many maps a job has. The default input split size is 64 MB (the same as the default value of DFS. Block. size ). However, if the input data volume is huge, the default 64 m block will contain tens of thousands or even hundreds of thousands of map tasks, and the network transmission of the cluster will be very large, the most serious problem is that it puts a lot of pressure on job tracker scheduling, queue, and memory. Mapred. Min. Split. Size determines each
 The minimum value of input split. You can modify this parameter to change the number of map tasks.
- An appropriate map degree of parallelism is about 10-maps per node, and it is recommended that each map be executed for at least one minute.
- The number of reduce tasks is set by mapred. Reduce. tasks. The default value is 1.
- The proper number of reduce tasks is 0.95 or 0.75 * (nodes * mapred. tasktracker. reduce. tasks. maximum), where, mapred. tasktracker. tasks. reduce. the number of maximum is generally set to the CPU of each node.
 The number of cores, that is, the number of slots that can be calculated simultaneously. For 0.95, when the map ends, all reduce tasks can be started immediately. For 1.75, after the first round of reduce is completed for a fast node, the second round of reduce tasks can be started to improve load balancing.
 
 
Hive is used to execute related queries.
 
In hadoop, the default mapred. tasktracker. Map. Tasks. Maximum setting is 2.
 
That is, each tasktracker runs 2 Map tasks at the same time.
 
By default, the system queries the operation logs of a user in 80 days, which takes 5 minutes and 45 sec.
 
After testing, it is found that it is more appropriate to set mapred. tasktracker. Map. Tasks. Maximum to the number of CPU cores of the node or to reduce the number by 1.
 
At this time, the operation efficiency is the highest, which costs about 3 mins and 25 sec.
 
Our current machines are all 8-core, so the final configuration is as follows:
 
 
 <Property>
<Name> mapred. tasktracker. Map. Tasks. Maximum </Name>
<Value> 8 </value>
<Description> the maximum number of map tasks that will be run
Simultaneously by a task tracker.
</Description>
</Property>
 
 
For mapred. Map. Tasks (number of map tasks for each job), the default value of hadoop is 2.
 
You can set it through set mapred. Map. Tasks = 24 before executing hive.
 
However, because hive is used to operate multiple input files, hive sets the number of map tasks to the number of input files by default.
 
Even if you set the number through set, it does not work...