Hadoop authoritative guide study Note 3

Last Update:2018-06-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a file cluster that provides big data processing and analysis. The most important one is HDFS (HadoopDistributedFileSystem), that is, Had.

HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important of them is HDFS (Hadoop Distributed File System), that is, Had.

HDFS Introduction

Disclaimer: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is for your reference only. If you have any questions, I hope to point out that you can learn and make progress together.

To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important one is HDFS (Hadoop Distributed File System), which is the Hadoop Distributed File System.

HDFS is a system that stores ultra-large files in Stream Data Access Mode (one write mode for multiple reads. It does not need a high-end hardware system, and ordinary hardware on the market can meet the requirements.

Currently, HDFS is not suitable for applications such as low-latency data access, a large number of small files, and arbitrary file modifications written by multiple users.

HDFS storage is block-based. Generally, the block size is 64 MB. The reason why HDFS is divided into such a large block is mainly to reduce the addressing time, because currently, the data transmission rate is getting faster and faster. When HDFS processes big data, frequent addressing will inevitably lead to a longer running time.

HDFS clusters have two node names and multiple data nodes. The Name node acts as the manager, and the data node acts as the worker. The Name node is equivalent to the branches and forks in the HDFS file tree, while the data node is labeled with the storage information of all blocks. Therefore, the loss of Name nodes means that HDFS is paralyzed. Therefore, Hadoop provides two mechanisms to solve this problem:

One is to copy a persistent state file that forms the metadata of the file system. That is, a remote NFS mount is also written to the local disk.

Set a second-level Name node.

HDFS provides command line interface interaction.

Hadoop is an abstract file system concept. HDFS is a specific implementation. The java Abstract class org. apache. hadoop. fs. fileSystem shows a file system of Hadoop, and there are several specific implementations.

As shown in, Hadoop provides many file interfaces. It usually uses URLs to determine which file system is used for interaction.

Hadoop is implemented in java, so the java interface is undoubtedly the top priority. Below are some specific implementations of the java interface.

(1) Data Reading:

Use URL to read data

Java's URL scheme for identifying the Hadoop file system is to call the setURLStreamHandlerFactory method in the URL through an FsUrlStreamHandlerFactory instance.

Note: This method can only be called once in a Java virtual machine, so it is usually set to static. Therefore, if another part of the program (which may not be a third-party part under your control) is set to URLStreamHandlerFactory, data cannot be read from Hadoop for a long time.

Code:

Enter run:

% Hadoop URLCat hdfs: // localhost/user/tom/test.txt

Result:

Hello world

Hello world Hell [this article from Internet (http://www.68idc.cn)] o world

Use FileSystem API to read data

Check the Code directly. Pay attention to the comments.

(2) Data Writing

The FileSystem class has a series of file creation methods.

Public FSDataOutputStream create (Pathf) throws IOException

Using create to create a file is available exists () to determine whether the parent directory exists.

There is also an overload method Progressable that is used to pass the callback interface, so that the application we write will be notified of the progress of Data writing to the data node.

Package org. apache. hadoop. util;

Public interface Progressable {

Publicvoid progress ();

}

You can also use the following methods to create a file:

Public FSDataOutputStream append (Pathf) throws IOException

This method allows you to append data to the end of an opened file.

(3) Directory

FileSystem describes how to create a directory:

Public Boolean mkdirs (Path f) thorwsIOException

(4) query the File System

The FileStatus class encapsulates metadata of files and directories in the file system, including the file length, block size, copy, modification time, owner, and license information.

GetFileStatus () of FileSystem provides a method to obtain the State object of a file or directory.

If you only determine whether a file exists, you can use the exists (Path f) method mentioned above.

Sometimes, wildcards are often used in Hadoop to query batch files. Therefore, it provides

Hadoop supports two FileSystem methods with the same wildcard as Unix bash:

Public FileStatus [] globStatus (PathpathPattern) throws IOException

Public FileStatus [] globStatus (Path pathPattern, PathFileter filter) throws IOException

Wildcard:

(5) delete data

The delete () method in FileSystem can permanently delete directories.

Public Boolean delete (Path f, Boolean recursive) throwsIOException

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop authoritative guide study Note 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop authoritative guide study Note 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support