Hadoop is a software framework that enables distributed processing of large amounts of data. Hadoop processes data in a reliable, efficient, and scalable way. Because it assumes that compute elements and storage will fail, it maintains multiple copies of work data, ensures that it can be re-distributed against failed nodes, works in parallel, accelerates processing by parallel processing, and can process petabytes of data.
Hadoop consists primarily of HDFs, MapReduce, and HBase. It is a distributed system infrastructure developed by the Apache Foundation.
Due to the chaotic version of Hadoop and the evolution of the Apache Hadoop version, as of December 23, 2012, the Apache Hadoop version was divided into two generations, and we referred to the first generation of Hadoop as Hadoop 1.0 and the second generation of Hadoop as Hadoop 2.0. differences between the two versions
There is still a big difference between the Hadoop1.0 and the Hadoop2.0 versions.
Overall, the 1.0 version is comprised of a distributed storage System HDFS and distributed computing Framework MapReduce, where HDFs consists of a namenode and multiple datanode, MapReduce consists of a jobtracker and multiple tasktracker.
2.0, the introduction of HDFs fedration, which allows multiple namenode in charge of different directories in order to achieve access isolation and horizontal expansion, while addressing the problem of Namenode single point of failure, the introduction of resource management framework yarn, can be used for various types of applications resource management and scheduling. The basic concept of HDFs Data Block block The default most basic storage unit for HDFS (Hadoop Distributed File System) is 64M of data blocks each chunk is at least three datanode. The files in HDFs are in HDFs, which is partitioned into 64M chunks of data. If a file is smaller than the size of a block of data, it does not occupy the entire block storage space metadata node Namenode and data node Datanode Namenode the namespace used to manage the file system Datanode is where the data is actually stored in the file system The Basic command for HDFs lists the files under HDFs
$ HDFs dfs-ls
$ hdfs dfs-ls in//list files in a document named in under the HDFs file
Uploading files
$/usr/local/hadoop$bin/hadoop dfs-put test1 Test
//upload the Test1 file in the Hadoop directory to HDFs and rename it to test
Download files to Local
$/usr/local/hadoop$bin/hadoop dfs-get in Getin
//The in file in HDFs is copied to the local system and named Getin
deleting files
$/usr/local/hadoop$bin/hadoop DFS-RMR out
//Remove HDFs under named out document
View Files
$/usr/local/hadoop$bin/hadoop Dfs-cat in/*
//Remove HDFs under named out document
Create a Directory
$/usr/local/hadoop$bin/hadoop dfs-mkdir/user/hadoop/examples
//HDFs can only be a new directory at a level
Copying files
$/usr/local/hadoop$bin/hadoop dfs-copyfromlocal
application Scenarios for HDFs
HDFs is suitable for high throughput, write-once, multiple-read scenarios.
Not suitable for a large number of small file storage, low latency, multi-user Read and write operation of the scene. HDFS File Read
When HDFs reads a file,
1. First create the RPC connection;
2. Obtain the location of the file data block from the Namenode;
3. Establish a socket link between the client and Datanode, and select the closest Datanode node to the client;
4. Download the data block of the file. HDFs File Write
1. Create an RPC connection;
2. Create file meta-data;
3. Allocating blocks for files;
4. A socket (4,5,6) is established between the client and the Datanode,datanode, and the socket is successfully established before the data block can be written;
5. Each datanode receives a block and reports it to the Namenode;
6. If a file has three copies of each block, as long as a copy is reported to Namenode, the client sends a message to the Namenode confirming that the file is complete; the Java interface in HDFs reads data from the Hadoop URL
InputStream in = null;
try{in
= new URL ("Hdfs://host/path"). OpenStream ();
} finally{
Ioutils.closestream (in);
}
reading data through the FileSystem API
String uri = "Hdfs://host/path";
Configuration conf = new configuration ();
FileSystem fs = Filesystem.get (Uri.create (URI), conf);
InputStream in = null;
try{in
= Fs.open (new Path (URI));
Ioutils.copybytes (in, System.out, 4096, false);
} finally{
Ioutils.closestream (in);
}
Write Data
View Catalog
Delete Data