First, what is Hadoop?
Hadoop is a distributed system infrastructure developed by the Apache Foundation. The core design of the Hadoop framework consists of two aspects, one Distributed File System (Hadoop Distributed File systems), or HDFS, and the distributed computing Framework MapReduce. In short, HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.
The name Hadoop is not an abbreviation, it is the name of an elephant named by the child of its founder Doug Cutting, and the name itself is not descriptive.
Second, the HDFS architecture
A distributed storage system.
1. Features
(1) Master-slave mode, one main node, multiple slave nodes (default three).
(2) block (block) is the smallest storage unit, the default is 64MB. A file is partitioned into multiple tiles (chunk) for storage
(3) write-once, multiple-read access mode
2. The function of each node
(1) Namenode, the primary node, is used to store metadata information and to manage the namespace of the file system.
(2) Datanode, from nodes, stores data information, and periodically sends the Namenode a list of the blocks they store.
(3) Secondary namenode, help namenode Merge edit log, reduce Namenode startup time
Namenode and Datanode are not difficult to understand, but the role of secondary namenode has seen a variety of statements, there is one of the following:
Fsimage is a snapshot of the entire system at Namenode startup
Edit log is a sequence of changes to the file system after Namenode is started, and these edit logs are merged into Fsimage only when Namenode is started again
Therefore, if the edit log is too long, the namenode start will be very slow
Secondary Namenode will periodically go to Namenode to take the edit log, and update to their fsimage, and then copied to Namenode, Namenode start directly with this snapshot.
On the other hand, because secondary namenode work content, when the Namenode failure, can also play a role
3. Work Flow
Simply understood, for reading data, as shown in:
1) The client sends a request to return the location information of the data store from the main node;
2) The client obtains data from Datanode based on the specific location information.
However, the real process is much more complex
(1) The client reads data from HDFs
1) The client opens the file system to the Distributedfilesystem object (Distributed File system) by calling the open () method
2) Distributedfilesystem Create an RPC call to Namenode and request the file address
Namenode returns the Datanode address where each block of the file is located (returns the address of all replicas, but is sorted by distance)
Distributedfilesystem Returns a Fsdatainputstream object (file location input stream) to the client
Fsdatainputstream encapsulates the Dfsinputstream object (Managing node I/O)
3) The client calls the Read method on the above input stream
4) Dfsinputstream connects to the nearest datanode that stores the starting block of the file, returning the data from Datanode to the client
When the transfer is complete, Dfsinputstream will close the connection to the Datanode
5) Proceed to step 4 of the next block
6) After reading, the client calls the close () method on the Fsdatainputstream
(2) client writes files to HDFs
The following procedure first creates a new file, writes the data to the file, and then closes the file
1) The client invokes the Create () method on the Distributedfilesytem object and creates the file
2) Distributedfilesystem creates an RPC call to Namenode and requests that a new file be created in the file system's namespace
Namenode performs various checks to verify that this file does not exist, and that the client has permission to create this file
If the check does not pass, the file creation fails and a IOException exception is returned to the client
If passed, Namenode adds a record to create a new file,
Distributedfilesystem returns a Fsdataoutputstream object to the client
Fsdataoutputstream encapsulates the Dfsoutputstream object (Managing node I/O)
3) Client writes data
Dfsoutputstream data into a block of data and writes to an internal queue called a "data queue"
Datastreamer processing data queues, according to Datanode list requirements Namenode allocate appropriate new blocks to store data backups
A set of Datanode constitute a pipeline, assuming that the replica is 3, then there are 3 nodes in the pipeline.
4) Datastreamer stream the data block to the 1th Datanode in the pipeline,
The Datanode stores the data and sends it to a second datanode, the second to the third
Dfsoutputstream also maintains a "confirmation queue" and waits for a confirmation receipt from Datanode
5) When a confirmation receipt is received, the data information in the queue is deleted
6) After writing, the client calls the close () method on the Fsdataoutputstream
7) Distributedfilesystem send file write completion signal to Namenode
4. Advantages:
(1) Fault tolerance: Data multi-copy storage, if the Datanode node fails, will automatically back up the data
(2) Shortest path read: Namenode can be the best path, return to the file stream for data read
(3) Scalability: If a new machine is added, Namenode will automatically start storing the data to that machine.
4. Defects
(1) Single point of failure problem: If the Namenode node fails, the entire HDFS will be invalidated.
(2) Security problem: Client can bypass Namenode directly to read and write to Datanode
Third, MapReduce
A distributed computing framework based on distributed storage System.
1. MapReduce Program Execution steps
(1) input
(2) Split process: slicing input data
(3) Map process (map): Parse data into key/value pairs
(4) Shuffle&sort process (group sorting): Key value pairs are grouped according to key values, and data with the same key is divided into a group, each sent to a reduce processing
(5) Reduce process (Statute): The data for the specification processing
(6) Output
2, the implementation of the MapReduce task flow
(1) User launcher: User starts running MapReduce program via Hadoop command
(2) jobclient get job id:jobclient contact Jobtracker get a job ID
(3) Preparation of jobclient initialization:
① copy code, configuration, slice information, etc. to HDFs
② partitioning of data based on input data path, block size, and set shard size
③ checking the output directory
(4) Jobclient Submit job: Jobclient submit job ID and corresponding resource information to each Jobtracker
(5) Jobtracer initialization job: Jobtracker encapsulates user-submitted information to objects for tracking while adding jobs to the job scheduler
(6) Jobtracker get shard information: Jobtracker Get information about the location, boundaries, etc. of each shard
(7) Taskertracker get task: Tasktracker get task information from Jobtracker via heartbeat mechanism (task ID, data location)
(8) Taskertracker get data: Taskertracker reads data from HDFs and copies it to its own machine
(9) Taskertracker Run Task: Taskertracker start child JVM Run task
(10) Specific map or reduce execution: in the sub-JVM, the map or reduce task is executed.
After the execution, notify the Tasktracker,tasktracker and then notify the Jobtracker, and then notify the client
Note:
(1) Job: Jobs, usually the entire work to be processed is called a job, and a job contains multiple tasks
(2) Task: Tasks, one job divided into multiple tasks for processing
(3) Heartbeat mechanism: Tasktracker will be at a certain frequency to the Jobtracker information feedback, report the current state and so on.
Iv. Follow-up content
1. Build Hadoop pseudo-distributed environment
2. MapReduce simple Application (Java/python) and related Hadoop commands
3. A brief introduction to the Hadoop eco-circle
Hadoop (i): overview