Hadoop interview questions (1) and hadoop interview questions

Source: Internet
Author: User

Hadoop interview questions (1) and hadoop interview questions

I. Q &:

1. Briefly describe how to install and configure an apache open-source version of hadoop. You only need to describe it. You do not need to list the complete steps and the steps are better.

1) install JDK and configure environment variables (/etc/profile)

2) disable the Firewall

3) configure the hosts file to facilitate hadoop access through the host name (/etc/hosts)

4) set ssh password-free Login

5) decompress the hadoop installation package and configure environment variables.

6) modify the configuration file ($ HADOOP_HOME/conf)

Hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml

7) format the hdfs File System (hadoop namenode-format)

8) Start hadoop ($ HADOOP_HOME/bin/start-all.sh)

9) Use jps to view Processes


2. List the processes that hadoop needs to start in a normal hadoop cluster. What are the roles of these processes? Write them as comprehensively as possible.

1) NameNode: HDFS daemon, which records how files are divided into data blocks and the data blocks are stored on those data nodes respectively, its main function is to centrally manage memory and IO

2) Secondary NameNode: a Secondary background program that communicates with NameNode to regularly save snapshots of HDFS metadata.

3) DataNode: reads and writes HDFS data blocks to the local file system.

4) JobTracker: allocates tasks and monitors all running tasks.

5) TaskTracker: executes specific tasks and interacts with JobTracker.


3. list all the hadoop schedulers you know and briefly describe their working methods.

The three most popular schedulers are default schedulers FIFO, computing capability schedulers Capacity schedity, and Fair schedulers Fair Scheduler.

1) default scheduler FIFO

The default scheduler in hadoop adopts the first-in-first-out principle

2) computing capability Scheduler Capacity schedity

Select to execute the command with a low resource occupation and a high priority.

3) Fair Scheduler

Jobs in the same queue share all resources in the queue fairly


4. Hive stores metadata in some ways, which have their own characteristics.

1) memory database derby, which is small and not commonly used

2) Local mysql, more common

3) Remote mysql, not commonly used


5. Briefly describe how hadoop achieves secondary sorting.

In Hadoop, keys are sorted by default. What if values are to be sorted?

There are two methods for secondary sorting: buffer and in memory sort and value-to-key conversion.

Buffer and in memory sort

The main idea is: In the reduce () function, all values corresponding to a key are saved and sorted. The biggest drawback of this method is that it may cause out of memory.


Value-to-key conversion

The main idea is to splice the key and some values into a combined key (to implement the WritableComparable interface or call the setSortComparatorClass function). In this way, the result obtained by reduce is sorted by the key first, after sorting results by value, you must note that you need to implement Paritioner by yourself so that only data is divided by key. Hadoop explicitly supports secondary sorting. There is a setGroupingComparatorClass () method in the Configuration class, which can be used to set the key value of the sorting group.

Http://dongxicheng.org/mapreduce/hadoop-join-two-tables/


6. Briefly describe several methods for hadoop to implement Join.

1) reduce side join

Reduce side join is the simplest join method. Its main idea is as follows:

In the map stage, the map function reads two files, File1 and File2 at the same time. To distinguish the key/value pairs of the two sources, a tag is assigned to each data entry, for example: tag = 0 indicates from file File1, tag = 2 indicates from file file2. That is, the main task in the map stage is to tag data in different files.

In the reduce stage, the reduce function obtains the value list from File1 and File2 files with the same key, and then performs join (Cartesian Product) on the data in File1 and File2 for the same key ). That is, the actual connection operation is performed in the reduce stage.


2) map side join

Reduce side join exists because all required join fields cannot be obtained in the map stage, that is, the fields corresponding to the same key may be located in different maps. Reduce side join is very inefficient, because the shuffle stage requires a large amount of data transmission.

Map side join is optimized for the following scenarios: two tables to be connected, one of which is very large and the other is very small, so that small tables can be directly stored in the memory. In this way, we can copy multiple copies of a small table so that each map task memory contains one copy (for example, stored in a hash table), and then only scan the large table: for each key/value record in a large table, check whether there are records with the same key in the hash table. If so, connect and output the records.

To support file replication, Hadoop provides a DistributedCache class, which can be used as follows:

(1) The user uses the static method DistributedCache. addCacheFile () specifies the file to be copied. Its parameter is the file URI (if it is a file on HDFS, it can be like this: hdfs: // namenode: 9000/home/XXX/file, where 9000 is the NameNode port number configured by yourself ). JobTracker obtains the URI list before the job starts and copies the corresponding files to the local disks of each TaskTracker. (2) The user uses the DistributedCache. getLocalCacheFiles () method to obtain the file directory and uses the standard file read/write API to read the corresponding file.


3) SemiJoin

SemiJoin, also known as semi-join, is a method used for reference in distributed databases. The motivation is: For reduce side join, the data transmission volume across machines is very large, which becomes a bottleneck of join Operations, if you can filter out data that does not participate in the join operation on the map end, it can greatly save network I/O.

The implementation method is very simple: select a small table, suppose it is File1, extract the key involved in the join and save it to file File3. The File3 file is usually very small and can be stored in the memory. In the map stage, use DistributedCache to copy File3 to each TaskTracker, and filter out the records corresponding to the key in File2 that is not in File3. The rest of the reduce stage works the same as reduce side join.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.