Hadoop authoritative guide study Note 3

Source: Internet
Author: User
HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a file cluster that provides big data processing and analysis. The most important one is HDFS (HadoopDistributedFileSystem), that is, Had.

HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important of them is HDFS (Hadoop Distributed File System), that is, Had.

HDFS Introduction

Disclaimer: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is for your reference only. If you have any questions, I hope to point out that you can learn and make progress together.

To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important one is HDFS (Hadoop Distributed File System), which is the Hadoop Distributed File System.

1,

HDFS is a system that stores ultra-large files in Stream Data Access Mode (one write mode for multiple reads. It does not need a high-end hardware system, and ordinary hardware on the market can meet the requirements.

Currently, HDFS is not suitable for applications such as low-latency data access, a large number of small files, and arbitrary file modifications written by multiple users.

2,

HDFS storage is block-based. Generally, the block size is 64 MB. The reason why HDFS is divided into such a large block is mainly to reduce the addressing time, because currently, the data transmission rate is getting faster and faster. When HDFS processes big data, frequent addressing will inevitably lead to a longer running time.

HDFS clusters have two node names and multiple data nodes. The Name node acts as the manager, and the data node acts as the worker. The Name node is equivalent to the branches and forks in the HDFS file tree, while the data node is labeled with the storage information of all blocks. Therefore, the loss of Name nodes means that HDFS is paralyzed. Therefore, Hadoop provides two mechanisms to solve this problem:

One is to copy a persistent state file that forms the metadata of the file system. That is, a remote NFS mount is also written to the local disk.

Set a second-level Name node.

3,

HDFS provides command line interface interaction.

4,

Hadoop is an abstract file system concept. HDFS is a specific implementation. The java Abstract class org. apache. hadoop. fs. fileSystem shows a file system of Hadoop, and there are several specific implementations.

As shown in, Hadoop provides many file interfaces. It usually uses URLs to determine which file system is used for interaction.

5,

Hadoop is implemented in java, so the java interface is undoubtedly the top priority. Below are some specific implementations of the java interface.

(1) Data Reading:

Use URL to read data

Java's URL scheme for identifying the Hadoop file system is to call the setURLStreamHandlerFactory method in the URL through an FsUrlStreamHandlerFactory instance.

Note: This method can only be called once in a Java virtual machine, so it is usually set to static. Therefore, if another part of the program (which may not be a third-party part under your control) is set to URLStreamHandlerFactory, data cannot be read from Hadoop for a long time.

Code:

Enter run:

% Hadoop URLCat hdfs: // localhost/user/tom/test.txt

Result:

Hello world

Hello world

Hello world Hell [this article from Internet (http://www.68idc.cn)] o world

Use FileSystem API to read data

Check the Code directly. Pay attention to the comments.

(2) Data Writing

The FileSystem class has a series of file creation methods.

Public FSDataOutputStream create (Pathf) throws IOException

Using create to create a file is available exists () to determine whether the parent directory exists.

There is also an overload method Progressable that is used to pass the callback interface, so that the application we write will be notified of the progress of Data writing to the data node.

Package org. apache. hadoop. util;

Public interface Progressable {

Publicvoid progress ();

}

You can also use the following methods to create a file:

Public FSDataOutputStream append (Pathf) throws IOException

This method allows you to append data to the end of an opened file.

(3) Directory

FileSystem describes how to create a directory:

Public Boolean mkdirs (Path f) thorwsIOException

(4) query the File System

The FileStatus class encapsulates metadata of files and directories in the file system, including the file length, block size, copy, modification time, owner, and license information.

GetFileStatus () of FileSystem provides a method to obtain the State object of a file or directory.

If you only determine whether a file exists, you can use the exists (Path f) method mentioned above.

Sometimes, wildcards are often used in Hadoop to query batch files. Therefore, it provides

Hadoop supports two FileSystem methods with the same wildcard as Unix bash:

Public FileStatus [] globStatus (PathpathPattern) throws IOException

Public FileStatus [] globStatus (Path pathPattern, PathFileter filter) throws IOException

Wildcard:

(5) delete data

The delete () method in FileSystem can permanently delete directories.

Public Boolean delete (Path f, Boolean recursive) throwsIOException

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.