Hadoop authoritative Guide Learn note three

Source: Internet
Author: User

About HDFs

Hadoop is plainly a file cluster that provides the processing of analytics big data, most importantly the HDFS (Hadoop Distributed File System), or Hadoop distributed filesystem.

1.

HDFs is a system that stores large files in a streaming data access mode (one-time write-once-read mode). It does not need a high-end hardware system, the general market hardware can meet the requirements.

Currently not suitable for the application of HDFs: low-latency data access, a large number of small files, many users write arbitrary modified files.

2.

HDFs storage is in blocks, usually with a block size of 64M. The reason to be divided into such a large block, mainly to reduce the addressing time, because at present, the data transmission rate is more and more fast, for HDFS processing big data, if the frequent addressing will inevitably make the running time longer.

The HDFs cluster has two node name nodes and multiple data nodes. Where the name node acts as a manager, the data node acts as a worker. The name node is equivalent to the branch fork point on the HDFs file tree, and the data node is labeled with the stored information for all blocks. So the loss of the name node means that HDFs is paralyzed. So Hadoop offers two mechanisms to solve this problem:

One is to replicate the persisted state files that make up the file system metadata. That is, a remote NFS mount is written to the local disk as well.

The other is to set a level two name node.

3.

HDFS provides the interaction of the command-line interface.

4.

Hadoop is an abstract filesystem concept, and HDFs is a concrete implementation, and the Java abstract class Org.apache.hadoop.fs.FileSystem presents a file system for Hadoop, with a few specific implementations.



As shown, Hadoop provides an interface for many files, usually through URLs to determine which file system to use for interaction.

5.

Hadoop is a Java implementation, so the Java interface is undoubtedly one of the most serious, and here are some of the specific implementations of the Java interface.

(1) Data read:

Reading data using a URL

java identifies the URL scheme for the Hadoop file system by invoking a Fsurlstreamhandlerfactory instance to invoke the Seturlstreamhandlerfactory method in the URL.

Note: This method can only be called once in a Java virtual machine, so it is usually set to static, so if the other part of the program (perhaps not the third-party part you control) sets a urlstreamhandlerfactory, It's been too long to read data from Hadoop.

Code:


Input run:

% Hadoop Urlcat hdfs://localhost/user/tom/test.txt

Results:

Hello World, Hello World

Hello World

Hello World, Hello World

Reading data using the FileSystem API

Just look at the code, watch the notes.

(2) Data write

The FileSystem class has a series of methods for creating files.

Public Fsdataoutputstream Create (PATHF) throws IOException

Creating a file with Create is available exists () to determine whether its parent directory exists.

There is also an overloaded method progressable for passing the callback interface, so that the application we write will be told the progress of the data being written to the data node.

Package org.apache.hadoop.util;

public interface progressable{

Publicvoid progress ();

}

You can also create a file using the following methods:

Public Fsdataoutputstream Append (PATHF) throws IOException

This method allows data to be appended at the end of an open file.

(3) Catalogue

FileSystem topic How to create a directory:

Public Boolean mkdirs (Path f) thorwsioexception

(4) Querying the file system

The Filestatus class encapsulates the metadata for files and directories in the file system, including file length, block size, copy, modification time, owner, and licensing information.

FileSystem's Getfilestatus () provides a way to get the state object of a file or directory.

If you are just judging whether a file exists, you can use the exists (Path f) method mentioned earlier.

Hadoop sometimes uses a wildcard when querying a batch file, so it provides a way to perform wildcard characters.

Hadoop supports the same wildcard character as the Unix bash two filesystem methods:

Public filestatus[] Globstatus (pathpathpattern) throws IOException

Public filestatus[] Globstatus (Path pathpattern,pathfileter filter) throws IOException

Wildcard characters:


(5) Delete data

The delete () method in filesystem can permanently delete the directory.

Public Boolean Delete (Path f,boolean recursive) throwsioexception

Hadoop authoritative Guide Learn note three

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.