Hadoop Practical Notes

Source: Internet
Author: User
Tags hadoop ecosystem hadoop fs

First, the basic knowledge (inside the content contains most of the content of Hadoop, patiently read, there must be a harvest, if there are different can leave a message or on a degree)

1. Introduction to the Hadoop ecosystem

(1) HBase

Nosql database, Key-value storage

Maximize Memory utilization

(2) Hdfs

Introduction: Hadoop Distribute file system distributed filesystem

Maximizing Disk Utilization

Design principles for HDFs:

The file is stored in blocks (block), the default block (64M)(if a file does not have 64M, still occupies a block, but the physical memory may not have no 64M), you can also customize

Increased reliability and read throughput through replica machines

Each chunk has at least three datanode.

Single Master (NameNode) to reconcile storage metadata (metadata) single point of failure? Generally standby NameNode or use NFS service to avoid single point of failure

Client has no caching mechanism for files (no data caching)

Namenode main function provides name query service, it is a jetty server

Namenode Saving metadata information includes

(I) Documents Owership and permissions

(II) What blocks are included in the file

(III) Which datanode the block is stored in (escalated by Datanode startup)

NameNode metadata information is loaded into memory when it is started

Metadata stored in the disk file named "Fsimage" and block location information is not saved to Fsimage (Hdfs/namenode/current fsimage)

DataNode (DN): block is saved and the block information is reported to NN (NameNode) when the DataNode thread is started

By sending a heartbeat to the NN to maintain its contact (3 seconds at a time), if the nn 10 minutes did not receive the DN (DataNode) heartbeat, it is considered to be lost, then copy the other block come over

Replica placement policy for block:

The general situation is to save three copies:

The first copy: The DN placed in the upload file, if it is committed outside the cluster, randomly pick a disk is not too full, the CPU is not too busy node

Second copy: Placed on a node with a different rack on the first replica

Third copy: Placed on another node in the same rack as the second copy (PS: one namenode drives 4000 nodes)

Block size and number of copies by client side upload file HDFs when set, where the number of copies can be changed, block can not be uploaded after the change

Data corruption Handling (reliability):

When the DN reads the block, it calculates the checksum, if the computed checksum is not the same as the block creation value, it indicates that the block is corrupted

The client reads the BLOCK;NN token on the other DN and the block is corrupted, and then copies the block to the desired set file backup number

DN validates its checksum three weeks after its file is created

Secondnamenode (SNN) is rarely used:(enterprise, can understand the principle) (focusing on the mechanisms of fsimage and edits)

Import the local fsimage

is a cold backup of Namenode (unable to do automatic switching)

Modify the Namenode address of all cluster DN

Modify Client Side Namenode address

or modify the SNN IP to the original Nnip

It works when the nn merge edits log reduces the nn start time

Fsimage and edits (frequently asked Questions for interview):

When the edits file is very large, namenode need to execute each of these edits files at startup, which seriously affects the start-up time of the whole HDFs, the secondarynamenode mechanism merges edits files into Fsimage, To get it resolved, Secondarynamenode workflow (Fsimage and edits log):

1, SNN in a checkpoint time point and namenode to communicate, request Namenode stop using edits file record related operations, but temporarily write new writes operations to the new file edit.new to

2. SNN copy fsimage and edits from nn, download fsimage and edits files back to local directory via HTTP get from Namenode

3, SNN Merge edits and FSIMAGE,SNN will be downloaded from the Namenode fsimage loaded into memory, and then execute each action item in the edits file, so that the fsimage loaded into memory contains operations in Edites, This process is called merging.

4, after merging the fsimage and edites files in Snn, the new fsimage need to be uploaded to the Namenode, this is done by the HTTP POST method

5. Namenode will replace the old fsimage with the new fsimage received from the SNN, while the Edites.new file converts the usual edites file, so that the Edites file size is reduced, snn the entire merge and Namenode

Safe Mode:

Namenode boot, first load the image file (Fsimage) into memory and perform the actions in the edit log (edits)

Once the file system metadata mapping is successfully established in memory, a new Fsimage file is created that does not require Secondarynamenode and an empty edit log

Namenode starts listening for RPC and HTTP requests

At the moment Namenode is running in safe mode, that is, the Namenode file system is read-only for the client

The data block location in the system is not namenode maintained, but is stored in the Datanode as a block list

See which state namenode is in

Hadoop Dfsadmin-safemode Get

Entering Safe Mode (Hadoop) when booting is in safe mode

Hadoop Dfsadmin-safemode Enter

Exit Safe mode;

Hadoop Dfsadmin-safemode Leave

The reading and writing process of HDFs (the essence of Hadoop learning, the prerequisite knowledge of programming development, no reason, must learn ^_^!):

(1) The client opens the file with the open () function of the filesystem

(2) Disbutedfilesystem use RPC to call the metadata node, get the data block information of the file

(3) For each data block, the metadata node returns the address of the data node where the data block is stored

(4) Distributedfilesystem returns Fsdatainputstream to the client to read the data

(5) The client calls the stream's read () function to begin reading the data

(6) Dfsinputstream connect the closest data node that holds the first chunk of this file

(7) Data read from the node to the client

(8) When this block of data has been read, Dfsinputstream closes the link to this data node, and then connects the next node of the data to this file

(9) When the client has finished reading the data, call the Fsdatainputstream's close function

(10) in the process of reading data, if the client is in the Data node communication error, then try to connect the next data node containing this database

The failed data node is logged and is no longer connected

The process of writing a file:

(1) client calls creat () to create a file

(2) Distributedfilesystem use RPC to call the metadata node to create a new file in the file system's namespace

(3) The metadata node first determines that the file does not exist, and that the client has permission to create the file, and then creates a new file

(4) Distributedfilesystem return dfsoutstream, client for writing data

(5) The client begins to write data, dfsoutstream the data into chunks and writes

(6) Data Stream writes a block to the first data node in pipeline, the first data node sends the data block to the second data node, and the second data node sends the data to the third data node

(7) Dfsoutstream saves the ACK queue for the emitted data block, waits for the data node in the pipeline to tell the data has been written

If the data node fails during the write process:

Close pipeline to the beginning of the data queue with the block of ACK queue

The current data block is given a new flag by the metadata node in the data node that is already written, and the error node restarts to detect that its data block is obsolete

, it will be deleted

The failed data node is removed from the pipeline, and the other data block is written to the other two data nodes in the pipeline

The metadata node is notified that the data block is not a sufficient number of replication blocks to create a third backup in the future

When the client finishes writing the data, it calls the stream's close function, which writes all data blocks to the data node in pipeline and waits for the ACK queue to return successfully. Finally notifies the metadata node that the write is complete.

HDFs Development Common commands (Class Shell language):

Create a folder:

Hadoop Fs-mkdir Folder

eg: Hadoop Fs-mkdir/usr/hadoop/myfile

Folder

Upload a file:

Hadoop fs-put Files folder for Hadoop

Eg:hadoop Fs-put/wordcount.jar/usr/hadoop/myfile

deleting files

Hadoop fs-rm File

See what content files are in a folder?

Hadoop fs-ls File

View the contents of a file

Hadoop fs-text/File

Common commands for Hadoop administrators:

Hadoop Job-list Lists the jobs that are running

Hadoop job-kill<job_id> Kill Job

Hadoop fsck checks HDFs block status for damage

Hadoop dfsadmin-report Check HDFs block status, including DN information

Hadoop distcp hsfs:///hdfs://parallel copy

(3) Mapreduce

(1) programming model, mainly used for data analysis

(2) Maximizing CPU utilization

Hadoop Practical Notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.