First, the basic knowledge (inside the content contains most of the content of Hadoop, patiently read, there must be a harvest, if there are different can leave a message or on a degree)
1. Introduction to the Hadoop ecosystem
(1) HBase
Nosql database, Key-value storage
Maximize Memory utilization
(2) Hdfs
Introduction: Hadoop Distribute file system distributed filesystem
Maximizing Disk Utilization
Design principles for HDFs:
The file is stored in blocks (block), the default block (64M)(if a file does not have 64M, still occupies a block, but the physical memory may not have no 64M), you can also customize
Increased reliability and read throughput through replica machines
Each chunk has at least three datanode.
Single Master (NameNode) to reconcile storage metadata (metadata) single point of failure? Generally standby NameNode or use NFS service to avoid single point of failure
Client has no caching mechanism for files (no data caching)
Namenode main function provides name query service, it is a jetty server
Namenode Saving metadata information includes
(I) Documents Owership and permissions
(II) What blocks are included in the file
(III) Which datanode the block is stored in (escalated by Datanode startup)
NameNode metadata information is loaded into memory when it is started
Metadata stored in the disk file named "Fsimage" and block location information is not saved to Fsimage (Hdfs/namenode/current fsimage)
DataNode (DN): block is saved and the block information is reported to NN (NameNode) when the DataNode thread is started
By sending a heartbeat to the NN to maintain its contact (3 seconds at a time), if the nn 10 minutes did not receive the DN (DataNode) heartbeat, it is considered to be lost, then copy the other block come over
Replica placement policy for block:
The general situation is to save three copies:
The first copy: The DN placed in the upload file, if it is committed outside the cluster, randomly pick a disk is not too full, the CPU is not too busy node
Second copy: Placed on a node with a different rack on the first replica
Third copy: Placed on another node in the same rack as the second copy (PS: one namenode drives 4000 nodes)
Block size and number of copies by client side upload file HDFs when set, where the number of copies can be changed, block can not be uploaded after the change
Data corruption Handling (reliability):
When the DN reads the block, it calculates the checksum, if the computed checksum is not the same as the block creation value, it indicates that the block is corrupted
The client reads the BLOCK;NN token on the other DN and the block is corrupted, and then copies the block to the desired set file backup number
DN validates its checksum three weeks after its file is created
Secondnamenode (SNN) is rarely used:(enterprise, can understand the principle) (focusing on the mechanisms of fsimage and edits)
Import the local fsimage
is a cold backup of Namenode (unable to do automatic switching)
Modify the Namenode address of all cluster DN
Modify Client Side Namenode address
or modify the SNN IP to the original Nnip
It works when the nn merge edits log reduces the nn start time
Fsimage and edits (frequently asked Questions for interview):
When the edits file is very large, namenode need to execute each of these edits files at startup, which seriously affects the start-up time of the whole HDFs, the secondarynamenode mechanism merges edits files into Fsimage, To get it resolved, Secondarynamenode workflow (Fsimage and edits log):
1, SNN in a checkpoint time point and namenode to communicate, request Namenode stop using edits file record related operations, but temporarily write new writes operations to the new file edit.new to
2. SNN copy fsimage and edits from nn, download fsimage and edits files back to local directory via HTTP get from Namenode
3, SNN Merge edits and FSIMAGE,SNN will be downloaded from the Namenode fsimage loaded into memory, and then execute each action item in the edits file, so that the fsimage loaded into memory contains operations in Edites, This process is called merging.
4, after merging the fsimage and edites files in Snn, the new fsimage need to be uploaded to the Namenode, this is done by the HTTP POST method
5. Namenode will replace the old fsimage with the new fsimage received from the SNN, while the Edites.new file converts the usual edites file, so that the Edites file size is reduced, snn the entire merge and Namenode
Safe Mode:
Namenode boot, first load the image file (Fsimage) into memory and perform the actions in the edit log (edits)
Once the file system metadata mapping is successfully established in memory, a new Fsimage file is created that does not require Secondarynamenode and an empty edit log
Namenode starts listening for RPC and HTTP requests
At the moment Namenode is running in safe mode, that is, the Namenode file system is read-only for the client
The data block location in the system is not namenode maintained, but is stored in the Datanode as a block list
See which state namenode is in
Hadoop Dfsadmin-safemode Get
Entering Safe Mode (Hadoop) when booting is in safe mode
Hadoop Dfsadmin-safemode Enter
Exit Safe mode;
Hadoop Dfsadmin-safemode Leave
The reading and writing process of HDFs (the essence of Hadoop learning, the prerequisite knowledge of programming development, no reason, must learn ^_^!):
(1) The client opens the file with the open () function of the filesystem
(2) Disbutedfilesystem use RPC to call the metadata node, get the data block information of the file
(3) For each data block, the metadata node returns the address of the data node where the data block is stored
(4) Distributedfilesystem returns Fsdatainputstream to the client to read the data
(5) The client calls the stream's read () function to begin reading the data
(6) Dfsinputstream connect the closest data node that holds the first chunk of this file
(7) Data read from the node to the client
(8) When this block of data has been read, Dfsinputstream closes the link to this data node, and then connects the next node of the data to this file
(9) When the client has finished reading the data, call the Fsdatainputstream's close function
(10) in the process of reading data, if the client is in the Data node communication error, then try to connect the next data node containing this database
The failed data node is logged and is no longer connected
The process of writing a file:
(1) client calls creat () to create a file
(2) Distributedfilesystem use RPC to call the metadata node to create a new file in the file system's namespace
(3) The metadata node first determines that the file does not exist, and that the client has permission to create the file, and then creates a new file
(4) Distributedfilesystem return dfsoutstream, client for writing data
(5) The client begins to write data, dfsoutstream the data into chunks and writes
(6) Data Stream writes a block to the first data node in pipeline, the first data node sends the data block to the second data node, and the second data node sends the data to the third data node
(7) Dfsoutstream saves the ACK queue for the emitted data block, waits for the data node in the pipeline to tell the data has been written
If the data node fails during the write process:
Close pipeline to the beginning of the data queue with the block of ACK queue
The current data block is given a new flag by the metadata node in the data node that is already written, and the error node restarts to detect that its data block is obsolete
, it will be deleted
The failed data node is removed from the pipeline, and the other data block is written to the other two data nodes in the pipeline
The metadata node is notified that the data block is not a sufficient number of replication blocks to create a third backup in the future
When the client finishes writing the data, it calls the stream's close function, which writes all data blocks to the data node in pipeline and waits for the ACK queue to return successfully. Finally notifies the metadata node that the write is complete.
HDFs Development Common commands (Class Shell language):
Create a folder:
Hadoop Fs-mkdir Folder
eg: Hadoop Fs-mkdir/usr/hadoop/myfile
Folder
Upload a file:
Hadoop fs-put Files folder for Hadoop
Eg:hadoop Fs-put/wordcount.jar/usr/hadoop/myfile
deleting files
Hadoop fs-rm File
See what content files are in a folder?
Hadoop fs-ls File
View the contents of a file
Hadoop fs-text/File
Common commands for Hadoop administrators:
Hadoop Job-list Lists the jobs that are running
Hadoop job-kill<job_id> Kill Job
Hadoop fsck checks HDFs block status for damage
Hadoop dfsadmin-report Check HDFs block status, including DN information
Hadoop distcp hsfs:///hdfs://parallel copy
(3) Mapreduce
(1) programming model, mainly used for data analysis
(2) Maximizing CPU utilization
Hadoop Practical Notes