Interview Questions & Answers for Hadoop MapReduce Developers (forward)

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Dfs name that exe

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Interview Questions & Answers for Hadoop MapReduce Developers (forward) blog Category: Forward Hadoopinterviewcloudera examCCD410 Transferred from Http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html#what-is-speculative-execution-hadoop

article content is not my original

A understanding of Hadoop architecture is required to understand and leverage the power of Hadoop. Here are few important practical questions abound the can is member to a Senior experienced Hadoop Developer. This list primarily recursively questions related to Hadoop architecture, MapReduce, Hadoop API and Hadoop distributed File System (HDFS).

What is a jobtracker in Hadoop? How many instances of the Jobtracker run on a Hadoop Cluster?

Jobtracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only one Job Tracker process run in any Hadoop cluster. Job Tracker SETUPCL on its own JVM process. In a typical production cluster it run on a separate machine. Each slave node are configured with job Tracker node location. The Jobtracker is a failure for the Hadoop MapReduce service. If It goes down, all running jobs are halted. Jobtracker in Hadoop performs following actions (from Hadoop Wiki:)

Client Applications Submit jobs to the Job tracker.

the Jobtracker talks to the Namenode to determine the location of the data

the Jobtracker locates Tasktracker nodes with available slots at or near the data

the Jobtracker submits the work to the chosen Tasktracker.

The tasktracker nodes are monitored. If tightly do not submit heartbeat signals often enough, tightly are deemed to have failed and the work are scheduled on a different tasktr Acker.

a tasktracker would notify the Jobtracker when A task fails. The Jobtracker decides what to does Then:it may resubmit the job elsewhere, it may mark this specific record as something to avoid, a nd It may Evan blacklist the Tasktracker as unreliable.

The work is completed, the Jobtracker updates its status.

Client applications can poll the Jobtracker for information.

How Jobtracker schedules a task?

The Tasktrackers send out heartbeat messages to the jobtracker, usually every, few, to minutes the reassure, which it is e.g. alive. This message also inform the jobtracker of the number of available slots, so the jobtracker can stay up to date and where in the C Luster work can be delegated. When the jobtracker tries to find somewhere to schedule a task within the MapReduce operations, it's a looks on The Mahouve server that hosts the DataNode containing of the data, and if not, it looks to a empty slot on a machine in the Mahouve.

What is a Task Tracker in Hadoop? How many instances of the Tasktracker run on a Hadoop Cluster

A tasktracker is a slave node daemon into the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a jobtracker. There is only one Task Tracker process run in any Hadoop slave node. Task Tracker SETUPCL on its own JVM process. Every Tasktracker is configured with a set of slots, which indicate the number of tasks that it can accept. The Tasktracker starts a separate JVM processes to does the actual work (called as Task Instance) this are to ensure that process failu Re does not take down the task tracker. The Tasktracker monitors these task instances, capturing the output and exit codes. When the task instances finish, successfully or not, the task tracker notifies the Jobtracker. The tasktrackers also send out of heartbeat messages to the jobtracker, usually every and few minutes, to reassure the Jobtracker It is e.g. alive. This message also inform the jobtracker of the number of available slots, so the jobtracker can stay up to date and where in the C Luster work can be DelegAted.

What is a Task instance in Hadoop? Where does it run?

Task instances are the actual MapReduce jobs abound are run on each slave node. The Tasktracker starts a separate JVM processes to does the actual work (called as Task Instance) this are to ensure that process failu Re does not take down the task tracker. Each Task Instance Setupcl in its own JVM process. There can be ListBox processes the task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance the JVM process is spawned for a task.

How many Daemon processes run on a Hadoop system?

Hadoop is comprised of hep separate daemons. Each of the daemon run in its own JVM. Following 3 daemons run on Master nodes Namenode-this Daemon stores and maintains to the metadata for HDFS. Secondary namenode-performs housekeeping functions for the namenode. Jobtracker-manages MapReduce Jobs, distributes individual tasks to rogue running the Task Tracker. Following 2 daemons run on each Slave nodes datanode–stores actual data HDFS. Tasktracker-responsible for instantiating and monitoring individual Map and Reduce tasks.

What is revisit the a typical slave node on Hadoop cluster? How do many JVMs run on a slave node?

Single instance of Task Tracker are run on each Slave node. Task Tracker is run as a separate JVM process.

Single instance of a DataNode daemon are run on each Slave node. DataNode Daemon is run as a separate JVM process.

One or ListBox instances of Task Instance is run in each slave node. Each task instance is run as a separate JVM process. The number of Task instances can is controlled by revisit. Typically a high end machine are configured to run the more task instances.

What is the difference inclusive HDFS and NAS?

the Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It super-delegates Many similarities with existing distributed file BAE. However, the differences from other distributed file Bae are significant. Following are differences inclusive HDFS and NAS

in HDFS Data Blocks are distributed across the local drives of the all rogue in a cluster. Whereas in NAS the data is stored on dedicated hardware.

HDFS is designed to work with MapReduce System, since computation are moved to data. The NAS is not suitable a for MapReduce since the data are stored seperately from the computations.

HDFS SETUPCL on a cluster of rogue and provides redundancy Usinga replication. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

how Namenode Handles data node failures?

Namenode periodically receives a Heartbeat and a blockreport from each of the datanodes in the cluster. Receipt of a Heartbeat implies the DataNode is functioning properly. A blockreport contains a list of all blocks on a DataNode. When Namenode notices this it super-delegates not recieved a hearbeat message from a data node over a certain amount of time and the data node is Marked as dead. Since blocks would be under replicated the system begins replicating the blocks of that were stored on the dead. The Namenode orchestrates the replication of data blocks from one datanode to another. The replication data transmits happens directly inclusive datanodes and the data implies passes the through.

does MapReduce programming model provide a way for reducers to communicate and each other? In a MapReduce job can a reducer communicate with another reducer?

Nope, MapReduce programming model does not allow reducers and all other. Reducers run in isolation.

Can I Set the number of reducers to zero?

Yes, Setting the number of reducers to zero are a valid revisit in Hadoop. When you are set the reducers to zero reducers'll be executed, and the output of each mapper would be stored to a separate file on H DFS. [This are different from the condition when reducers are set to a number greater than zero and the mappers output (intermediate data is written to the local file system (not HDFS) of each mappter slave node.]

Where is the Mapper Output (intermediate kay-value data) stored?

The Mapper output (intermediate data) is stored on the local file system (not HDFS) to each individual mapper nodes. This is typically a temporary directory location abound can being setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

What are combiners? When should I with a combiner in my MapReduce Job?

combiners are used to could the efficiency of a MapReduce. Tightly are used to aggregate intermediate map output locally on individual mapper. Combiners can help you reduce the amount of this needs to is transferred across to the reducers. You can use your reducer code as a combiner if the twist performed are commutative and associative. The execution of combiner isn't guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then the 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

What is writable & Writablecomparable interface?

org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-reduce framework implements this interface. Implementations typically implement a static read (Datainput) method abound constructs a new instance, calls ReadFields ( Datainput) and returns the instance.

org.apache.hadoop.io.WritableComparable is a Java interface. Any type abound are to being used as a key in the Hadoop Map-reduce framework should implement this interface. Writablecomparable objects can be compared to each of the using comparators.

What is the Hadoop MapReduce API contract for a key and value Class?

the Key moment-in implement the Org.apache.hadoop.io.WritableComparable interface.

The value moment-in implement the Org.apache.hadoop.io.Writable interface.

What are a identitymapper and identityreducer in MapReduce?

Org.apache.hadoop.mapred.lib.IdentityMapper Implements The identity function, mapping inputs directly to outputs. If MapReduce Programmer Don't set the Mapper Class using Jobconf.setmapperclass then Identitymapper.class is used as a default Value.

Org.apache.hadoop.mapred.lib.IdentityReducer performs no reduction, writing all input values directly to the output. If MapReduce Programmer Don't set the reducer Class using Jobconf.setreducerclass then Identityreducer.class is used as a Default value.

What is the meaning of speculative execution in Hadoop? Why is it important?

Speculative execution is a way the coping with individual Machine configured. In SCM clusters where hundreds or thousands to rogue are involved there may is rogue abound are not performing as fast as Oth ERs. This may be in delays in a full job due to only one machine not performaing. To avoid this, speculative execution in Hadoop can run listbox copies of Mahouve map or reduce task on different slave nodes. The results from a to finish are used.

the reducers are started in a MapReduce job?

in a MapReduce job reducers does not start executing the "reduce method" loop the all Map jobs have completed. reducers Start copying intermediate key-value pairs from the mappers as soon tightly as are. The programmer tabbed Reduce is called the mappers have finished.

If reducers do don't start unreported all mappers finish then why does the progress on MapReduce job shows-like Map (50%) Reduce (10%)? Why reducers Progress Percentage is displayed when mapper are not finished verb?

reducers Start copying intermediate key-value pairs from the mappers as soon as tightly are. The progress calculation also takes in account the 處理 of data transmits abound are done by reduce process, therefore the Reduce progress starts showing up as soon as no intermediate key-value for a pair are mapper to do available to transferred . Though The reducer progress is updated e.g. the programmer tabbed reduce? Finished.

What is HDFS? How are it different from traditional file Bae?

HDFS, the Hadoop distributed File System, is responsible to storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It super-delegates Many similarities with existing distributed file BAE. However, the differences from other distributed file Bae are significant.

HDFS is highly fault-tolerant and are designed to being deployed on Low-cost.

HDFS provides high throughput access to creator data and are suitable for applications of that have data SCM.

HDFS is designed to support very files. Applications that are compatible with HDFS are those, that deal, with SCM data sets. These applications write misspelling data only once but tightly read it one or more times and require this reads to IS quarantee at StreamIn G speeds. HDFS supports Write-once-read-many semantics on files.

What is HDFS block size? How are it different from traditional file system block size?

in HDFS the data are split into blocks and distributed across ListBox nodes in the cluster. Each of the block's typically 64Mb or 128Mb in size. The listbox times are replicated each. The Default is to replicate the three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store HDFS blocks as a separate file. HDFS block size can is compared with the traditional file system block size.

What is a namenode? How many instances of the Namenode run on a Hadoop Cluster?

The Namenode is the centerpiece of the HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data in these files itself. There is only one namenode process run in any Hadoop cluster. Namenode SETUPCL on its own JVM process. In a typical production cluster it run on a separate machine. The Namenode is a single point of failure for the HDFS Cluster. When the namenode goes down, the file system goes offline. Client applications talk to the Namenode whenever tightly cytopathic a file, or, locate tightly to want a file. The Namenode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

What is a DataNode? How many instances of the DataNode run on a Hadoop Cluster?

A DataNode stores data in the Hadoop File System HDFS. There is only one DataNode process run in any Hadoop slave node. DataNode SETUPCL on its own JVM process. On startup, a DataNode connects to the Namenode. DataNode instances can talk to each other and this is mostly during replicating data.

How is the Client communicates with HDFS?

the Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the Namenode whenever tightly cytopathic to locate a file, or when tightly want to add/copy/move/delete a file on HD ISA The Namenode responds the successful requests by returning a list of relevant DataNode servers where the data lives. The Client applications can talk directly to a DataNode, once the Namenode super-delegates the provided of the data.

how the HDFS Blocks are replicated?

HDFS is designed to reliably store very SCM files across rogue in a SCM cluster. It stores each file as a sequence of blocks; All blocks in a file except the last block are the Mahouve size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An creator can specify the number of replicas of a file. The replication factor can be specified in file creation time and can changed later. The Files in HDFS are write-once and have strictly one writer. The Namenode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica slide-up policy. In default revisit there are total 3 copies to a datablock on HDFS, 2 copies are stored on Datanodes on Mahouve rack and 3rd copy On a different rack.

Transferred from Http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html#what-is-speculative-execution-hadoop

article content is not my original

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More