Some basic knowledge of Hadoop

Source: Internet
Author: User
Tags socket hdfs dfs

Hadoop is a software framework that enables distributed processing of large amounts of data. Hadoop processes data in a reliable, efficient, and scalable way. Because it assumes that compute elements and storage will fail, it maintains multiple copies of work data, ensures that it can be re-distributed against failed nodes, works in parallel, accelerates processing by parallel processing, and can process petabytes of data.

Hadoop consists primarily of HDFs, MapReduce, and HBase. It is a distributed system infrastructure developed by the Apache Foundation.

Due to the chaotic version of Hadoop and the evolution of the Apache Hadoop version, as of December 23, 2012, the Apache Hadoop version was divided into two generations, and we referred to the first generation of Hadoop as Hadoop 1.0 and the second generation of Hadoop as Hadoop 2.0. differences between the two versions

There is still a big difference between the Hadoop1.0 and the Hadoop2.0 versions.

Overall, the 1.0 version is comprised of a distributed storage System HDFS and distributed computing Framework MapReduce, where HDFs consists of a namenode and multiple datanode, MapReduce consists of a jobtracker and multiple tasktracker.

2.0, the introduction of HDFs fedration, which allows multiple namenode in charge of different directories in order to achieve access isolation and horizontal expansion, while addressing the problem of Namenode single point of failure, the introduction of resource management framework yarn, can be used for various types of applications resource management and scheduling. The basic concept of HDFs Data Block block The default most basic storage unit for HDFS (Hadoop Distributed File System) is 64M of data blocks each chunk is at least three datanode. The files in HDFs are in HDFs, which is partitioned into 64M chunks of data. If a file is smaller than the size of a block of data, it does not occupy the entire block storage space metadata node Namenode and data node Datanode Namenode the namespace used to manage the file system Datanode is where the data is actually stored in the file system The Basic command for HDFs lists the files under HDFs

$ HDFs dfs-ls
$ hdfs dfs-ls in//list files in a document named in under the HDFs file
Uploading files
$/usr/local/hadoop$bin/hadoop dfs-put test1 Test
//upload the Test1 file in the Hadoop directory to HDFs and rename it to test
Download files to Local
$/usr/local/hadoop$bin/hadoop dfs-get in Getin
//The in file in HDFs is copied to the local system and named Getin
deleting files
$/usr/local/hadoop$bin/hadoop DFS-RMR out
//Remove HDFs under named out document
View Files
$/usr/local/hadoop$bin/hadoop Dfs-cat in/*
//Remove HDFs under named out document
Create a Directory
$/usr/local/hadoop$bin/hadoop dfs-mkdir/user/hadoop/examples
//HDFs can only be a new directory at a level
Copying files
$/usr/local/hadoop$bin/hadoop dfs-copyfromlocal
application Scenarios for HDFs

HDFs is suitable for high throughput, write-once, multiple-read scenarios.
Not suitable for a large number of small file storage, low latency, multi-user Read and write operation of the scene. HDFS File Read


When HDFs reads a file,
1. First create the RPC connection;
2. Obtain the location of the file data block from the Namenode;
3. Establish a socket link between the client and Datanode, and select the closest Datanode node to the client;
4. Download the data block of the file. HDFs File Write


1. Create an RPC connection;
2. Create file meta-data;
3. Allocating blocks for files;
4. A socket (4,5,6) is established between the client and the Datanode,datanode, and the socket is successfully established before the data block can be written;
5. Each datanode receives a block and reports it to the Namenode;
6. If a file has three copies of each block, as long as a copy is reported to Namenode, the client sends a message to the Namenode confirming that the file is complete; the Java interface in HDFs reads data from the Hadoop URL

InputStream in = null;
try{in
    = new URL ("Hdfs://host/path"). OpenStream ();
} finally{
    Ioutils.closestream (in);
}
reading data through the FileSystem API
String uri = "Hdfs://host/path";
Configuration conf = new configuration ();
FileSystem fs = Filesystem.get (Uri.create (URI), conf);
InputStream in = null;
try{in
    = Fs.open (new Path (URI));
    Ioutils.copybytes (in, System.out, 4096, false);
} finally{
    Ioutils.closestream (in);
}
Write Data View Catalog Delete Data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.