Some basic knowledge of Hadoop

Last Update:2018-07-25 Source: Internet

Author: User

Tags socket hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a software framework that enables distributed processing of large amounts of data. Hadoop processes data in a reliable, efficient, and scalable way. Because it assumes that compute elements and storage will fail, it maintains multiple copies of work data, ensures that it can be re-distributed against failed nodes, works in parallel, accelerates processing by parallel processing, and can process petabytes of data.

Hadoop consists primarily of HDFs, MapReduce, and HBase. It is a distributed system infrastructure developed by the Apache Foundation.

Due to the chaotic version of Hadoop and the evolution of the Apache Hadoop version, as of December 23, 2012, the Apache Hadoop version was divided into two generations, and we referred to the first generation of Hadoop as Hadoop 1.0 and the second generation of Hadoop as Hadoop 2.0. differences between the two versions

There is still a big difference between the Hadoop1.0 and the Hadoop2.0 versions.

Overall, the 1.0 version is comprised of a distributed storage System HDFS and distributed computing Framework MapReduce, where HDFs consists of a namenode and multiple datanode, MapReduce consists of a jobtracker and multiple tasktracker.

2.0, the introduction of HDFs fedration, which allows multiple namenode in charge of different directories in order to achieve access isolation and horizontal expansion, while addressing the problem of Namenode single point of failure, the introduction of resource management framework yarn, can be used for various types of applications resource management and scheduling. The basic concept of HDFs Data Block block The default most basic storage unit for HDFS (Hadoop Distributed File System) is 64M of data blocks each chunk is at least three datanode. The files in HDFs are in HDFs, which is partitioned into 64M chunks of data. If a file is smaller than the size of a block of data, it does not occupy the entire block storage space metadata node Namenode and data node Datanode Namenode the namespace used to manage the file system Datanode is where the data is actually stored in the file system The Basic command for HDFs lists the files under HDFs

$ HDFs dfs-ls
$ hdfs dfs-ls in//list files in a document named in under the HDFs file

Uploading files

$/usr/local/hadoop$bin/hadoop dfs-put test1 Test
//upload the Test1 file in the Hadoop directory to HDFs and rename it to test

Download files to Local

$/usr/local/hadoop$bin/hadoop dfs-get in Getin
//The in file in HDFs is copied to the local system and named Getin

deleting files

$/usr/local/hadoop$bin/hadoop DFS-RMR out
//Remove HDFs under named out document

View Files

$/usr/local/hadoop$bin/hadoop Dfs-cat in/*
//Remove HDFs under named out document

Create a Directory

$/usr/local/hadoop$bin/hadoop dfs-mkdir/user/hadoop/examples
//HDFs can only be a new directory at a level

Copying files

$/usr/local/hadoop$bin/hadoop dfs-copyfromlocal

application Scenarios for HDFs

HDFs is suitable for high throughput, write-once, multiple-read scenarios.
Not suitable for a large number of small file storage, low latency, multi-user Read and write operation of the scene. HDFS File Read

When HDFs reads a file,
1. First create the RPC connection;
2. Obtain the location of the file data block from the Namenode;
3. Establish a socket link between the client and Datanode, and select the closest Datanode node to the client;
4. Download the data block of the file. HDFs File Write

1. Create an RPC connection;
2. Create file meta-data;
3. Allocating blocks for files;
4. A socket (4,5,6) is established between the client and the Datanode,datanode, and the socket is successfully established before the data block can be written;
5. Each datanode receives a block and reports it to the Namenode;
6. If a file has three copies of each block, as long as a copy is reported to Namenode, the client sends a message to the Namenode confirming that the file is complete; the Java interface in HDFs reads data from the Hadoop URL

InputStream in = null;
try{in
    = new URL ("Hdfs://host/path"). OpenStream ();
} finally{
    Ioutils.closestream (in);
}

reading data through the FileSystem API

String uri = "Hdfs://host/path";
Configuration conf = new configuration ();
FileSystem fs = Filesystem.get (Uri.create (URI), conf);
InputStream in = null;
try{in
    = Fs.open (new Path (URI));
    Ioutils.copybytes (in, System.out, 4096, false);
} finally{
    Ioutils.closestream (in);
}

Write Data View Catalog Delete Data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More