"Finishing Learning HDFs" Hadoop Distributed File system a distributed filesystem

Source: Internet
Author: User
Tags cassandra posix mapr isilon

The Hadoop Distributed File System (HDFS) is designed to be suitable for distributed file systems running on common hardware (commodity hardware). It has a lot in common with existing Distributed file systems. But at the same time, the difference between it and other distributed file systems is obvious. HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets. HDFs relaxes a subset of POSIX constraints to achieve the purpose of streaming data from the file system. HDFs was first developed as the infrastructure for the Apache Nutch search engine project. HDFs is part of the Apache Hadoop core project. HDFs is characterized by high fault tolerance (fault-tolerant) and is designed to be deployed on low-cost (low-cost) hardware. And it provides high throughput (hi throughput) to access application data for applications with very large datasets (large data set). HDFs relaxes (relax) POSIX requirements (requirements) so that data in the form of a stream can be accessed (streaming access) in the file system.

Hadoop Distributed File system, referred to as hdfs[1], is a distributed filesystem. HDFs is characterized by high fault tolerance (fault-tolerant) and is designed to be deployed on low-cost (low-cost) hardware. And it provides high throughput (hi throughput) to access application data for applications with very large datasets (large data set). HDFs relaxes (relax) POSIX requirements (requirements) so that data in the form of a stream can be accessed (streaming access) in the file system. HDFs was initially created for the Nutch infrastructure of the open source Apache project, and HDFs was part of the Hadoop project, and Hadoop was part of Lucene.

Hardware failure

Hardware failures are normal, not exceptions. The entire HDFS system will consist of hundreds of or thousands of servers that store a fragment of the file data. In fact, it has a very large component, each component is likely to fail, which means that some parts of HDFs is always invalid, therefore, fault detection and automatic fast recovery is a very central design goal in HDFs.

Streaming data access

Applications running on HDFs must have access to their datasets, which are not normal programs that run on top of the ordinary file system. HDFs is designed for batch processing, not user-interactive. The focus is on data throughput, not the response time of data access, and many of POSIX's hard-to-use requirements are not necessary for HDFS applications, and removing a small subset of POSIX critical semantics can yield better data throughput.

Programs running on HDFs have a very large set of data. A typical HDFs file size is a GB to TB level. Therefore, HDFs is tuned to support large files. It should provide high aggregate data bandwidth, support for hundreds of nodes in a cluster, and tens files in a cluster.

Simple consistency model

Most HDFS programs require a write-once read operation mode for file operations. Once a file is created, written, and closed, it does not need to be modified. This assumption simplifies data-consistent issues and makes high-throughput data access possible. A map-reduce program or web crawler can be perfectly suited to this model.

Mobile computing is more economical than mobile data

It is best to calculate near the location where the calculated data is stored, especially when the data set is particularly large. This eliminates the congestion of the network and improves the overall throughput of the system. One assumption is that migrating to a location closer to the data is better than moving the data closer to the program. HDFS provides an interface for the program to move itself closer to the data store.

Portability between heterogeneous software and hardware platforms

HDFs is designed to make it easy to migrate between platforms, which drives applications that require large datasets to use HDFs as a platform more broadly.

HDFs is a master-slave structure, an HDFS cluster is made up of a name node, which is a primary server that manages file namespaces and regulates client access to files, and of course some data nodes, usually a node of a machine that manages the storage of the corresponding nodes. HDFs opens the file namespace and allows user data to be stored as files.

The internal mechanism is to split a file into one or more blocks, which are stored in a set of data nodes. A file or directory operation that the name node uses to manipulate the file namespace, such as open, close, rename, and so on. It also determines the mapping of blocks to data nodes. Data node to be responsible for read and write requests from file system customers. The data node also performs block creation, deletion, and block copy instructions from the name node.

The HDFs (Hadoop Distributed File System) is the core sub-project of the Hadoop project and is the basis of data storage management in distributed computing, and frankly HDFs is a good distributed file system, it has many advantages, but there are some drawbacks, Include: not suitable for low latency data access, cannot efficiently store large numbers of small files, does not support multi-user write and arbitrary modification files.

When the Apache Software Foundation was established, HDFs was always looking for ways to improve its performance and usability, which, frankly, might be more appropriate for pilot projects, unconventional projects, and less demanding environments, but for some Hadoop users, they are for performance, availability, Enterprise-class features are highly demanding and focus on direct attached storage (DAS) architectures, especially if older versions of Hadoop do not have high-performance master nodes, then the next 8 products are a great alternative to HDFs.

1. Cassandra (DataStax)

Not a full file system, but an open source, NoSQL key value (Key-value) store. This provides a more HDFs choice for Web applications that rely on fast data access. In short, it blends Hadoop into Cassandra, enabling web apps to quickly access data through Hadoop, and Hadoop can quickly access data that flows into Cassandra.

650) this.width=650; "class=" Fit-image "title=" 1 "src=" http://images.51cto.com/files/uploadimg/20120713/1023020. JPG "style=" border:0px; "/>

2. Ceph

Ceph is an open source, multi-pronged operating system, because of its high performance parallel file system characteristics, some people even think it is based on the Hadoop environment of HDFs successor, because since 2010, researchers are looking for this feature.

650) this.width=650; "class=" Fit-image "title=" 2 "src=" http://images.51cto.com/files/uploadimg/20120713/1023021. JPG "style=" border:0px; "/>

3. Cleversafe: Distributed Storage Network

Cleversafe announced this week that it will integrate Hadoop's parallel programming technology with its own decentralized storage network. The principle is that by distributing the entire metadata in the cluster (not relying on a single master node, not relying on replication), Cleversafe says that this is faster, more stable, and more scalable than HDFs.

650) this.width=650; "class=" Fit-image "title=" 3 "src=" http://images.51cto.com/files/uploadimg/20120713/1023022. GIF "style=" border:0px; "/>

4. GPFS (IBM)

IBM has been selling its parallel file system to high-performance users, including the world's fastest supercomputer, 2010 years after it launched the Hadoop-based GPFS and announced that GPFS does not share a cluster version much faster than Hadoop because

It runs at the kernel level rather than running in the operating system, such as HDFs.

650) this.width=650; "class=" Fit-image "title=" 4 "src=" http://images.51cto.com/files/uploadimg/20120713/1023023. JPG "style=" border:0px; "/>

5. Isilon (EMC)

EMC has been delivering the Hadoop release for a year, but in January 2012 it was transformed into a new HDFS enterprise-level solution,--isilon's Onefs file system. Because Isilon can read NFS, CIFS, and HDFS protocols, a single Isilon NAS system can ingest, process, and analyze data.

650) this.width=650; "class=" Fit-image "title=" 5 "src=" http://images.51cto.com/files/uploadimg/20120713/1023024. JPG "style=" border:0px; "/>

6. Lustre

HPC storage provider Xyratex added in a 2011 report that lustre-based clusters are faster and cheaper than HDFS-based clusters.

650) this.width=650; "class=" Fit-image "title=" 6 "src=" http://images.51cto.com/files/uploadimg/20120713/1023025. JPG "style=" border:0px; "/>

7. MapR File System

MapR file system has a certain reputation in the industry, not only MapR announced that its own file system is 2-5 times faster than HDFs (actually 20 times times), it also has mirroring, snapshots, high-performance These enterprise users like the characteristics.

650) this.width=650; "class=" Fit-image "title=" 7 "src=" http://images.51cto.com/files/uploadimg/20120713/1023026. JPG "style=" border:0px; "/>

8. NetApp Hadoop Open Solution

NetApp has revamped its physical Hadoop architecture by placing HDFs on disk arrays to achieve faster, more stable, and more secure Hadoop work.

650) this.width=650; "class=" Fit-image "title=" 8 "src=" http://images.51cto.com/files/uploadimg/20120713/1023027. JPG "style=" border:0px; "/>



"Finishing Learning HDFs" Hadoop Distributed File system a distributed filesystem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.