Hadoop Distributed File System-hdfs

Source: Internet
Author: User

    • Hadoop history

Embryonic beginning in 2002, Apache Nutch,nutch is an open source Java implementation of the search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers.

Then in 2003 Google published a technical academic paper Google File system (GFS). GFS is the proprietary file system designed by Google file System,google to store massive amounts of search data.

2004 Nutch founder Doug Cutting, based on Google's GFS thesis, implemented a distributed file storage system named NDFs.

In 2004, Google also published a technical academic paper mapreduce. MapReduce is a programming model for parallel analysis operations of large datasets (larger than 1TB).

2005 Doug Cutting was also based on MapReduce, which was implemented in the Nutch search engine.

In 2006, Yahoo hired Doug cutting,doug cutting to name NDFs and MapReduce upgrades as Hadoop,yahoo opened a separate team for Goug cutting to specialize in the development of Hadoop.

It has to be said that Google and Yahoo have contributed to Hadoop.

    • Hadoop Core

The core of Hadoop is HDFs and MapReduce, and both are theoretical foundations, not specific, high-level applications, and Hadoop has a number of classic sub-projects, such as HBase, Hive, which are developed based on HDFs and MapReduce. To understand Hadoop, you have to know what HDFs and MapReduce are.

    • Hdfs

HDFS (Hadoop Distributed File System,hadoop distributed filesystem), it is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access for applications with very large datasets (large data set).

The design features of HDFs are:

1, Big data files, very suitable for large files on the T-level or a heap of large data file storage, if the file only a few grams or even smaller is not very interesting.

2, File block storage, HDFs will be a complete large file average block storage to different calculators, it is the meaning of reading a file can be simultaneously from multiple hosts to take different chunks of files, multi-host read more than a single host read efficiency is much higher.

3, streaming data access, write many times a time, this mode is different from the traditional file, it does not support the dynamic change of file content, but requires that the file write once do not change, to change can only add content at the end of the file.

4, cheap hardware, HDFs can be applied on the ordinary PC, this mechanism can give some companies with dozens of cheap computers can prop up a big data cluster.

5, hardware failure, HDFS think all computers may be problems, in order to prevent a host failure to read the block file of the host, it will be the same file block copy assigned to some other hosts, if one of the host fails, you can quickly find another copy to fetch the file.

Key elements of HDFs:

Block: A file is chunked, usually 64M.

NameNode: Save the entire file system directory information, file information and block information, this is the only one host dedicated to save, of course, if this host error, NameNode will fail. Start supporting Activity-standy mode at hadoop2.*----If the primary namenode fails, start the standby host to run Namenode.

DataNode: Distributed on inexpensive computers for storing block block files.

    • Mapreduce

In layman's terms, MapReduce is a set of programming models that finally return the result set from the massive, source data extraction analysis element, which is the first step in distributing the file to the hard disk, and extracting the analysis from the massive data we need is what mapreduce does.

The following is an example of calculating the maximum value of a large amount of data: a bank has billions of depositors, and the bank wants to find out how much of the money is stored, in the traditional way we do:

Java code:
Long moneys[] ... Long max = 0l;for (int i=0;i<moneys.length;i++) {  if (Moneys[i]>max) {    max = moneys[i];  }}

If the computed array length is small, there will be no problem with this implementation, or there is a problem when faced with massive amounts of data.

MapReduce would do this: first the numbers are stored in different blocks, with a few blocks as a map, the largest values in the map are computed, and the maximum value in each map is made to reduce, and the maximum value is removed to the user.


The basic principle of mapreduce is that the large data analysis is divided into small pieces and analyzed, and then the extracted data are aggregated and analyzed, and finally get the content we want. Of course how to split the block analysis, how to do the reduce operation is very complex, Hadoop has provided the implementation of data analysis, we just need to write simple requirements command to achieve the data we want.

    • Summarize

In general, Hadoop is suitable for applications in big data storage and big data analytics, and is suitable for servers from thousands of to tens of thousands of clusters, supporting petabytes of storage capacity.

Typical Hadoop applications include: Search, log processing, referral systems, data analysis, video image analysis, data preservation, and more.

The simple introduction of the original text connection is: http://blessht.iteye.com/blog/2095675,

Not to be continued ~

Hadoop Distributed File System-hdfs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.