Open-source implementation of running applications in Linux Hadoop

Source: Internet
Author: User

Linux Hadoop is quite common, So I studied Linux Hadoop and shared it with you here. I hope it will be useful to you. What is Linux Hadoop.

Linux Hadoop is a framework used to run applications on cheap hardware devices in large clusters. Linux Hadoop provides a set of stable and reliable interfaces and data motion transparent to applications. Google's MapReduce algorithm is implemented in Linux Hadoop, which divides applications into many small units of work. Each unit can be executed on any cluster node or repeatedly.

In addition, Linux Hadoop provides a distributed file system for storing data on various computing nodes and provides high throughput for data read/write. Because the map/reduce and distributed file systems are applied to make the Linux Hadoop framework highly fault tolerant, it will automatically process failed nodes. The Linux Hadoop framework has been tested in clusters with 600 nodes.

Google's data center uses a cheap Linux PC to form a cluster and runs various applications on it. Even new users of distributed development can quickly use Google's infrastructure. There are three core components:

1. GFSGoogle File System ).

A Distributed File System that hides details such as lower-layer load balancing and redundant replication, and provides a unified file system API interface for upper-layer programs. Google has made special optimizations based on its own needs, including: Access to ultra-large files, the proportion of read operations far exceeds the write operation, and the PC is prone to failures, resulting in node failure.

GFS divides files into 64 MB blocks, which are distributed on machines in the cluster and stored in the Linux File System. At the same time, each file must have at least three copies of redundancy. The Center is a Master node that searches for file Blocks Based on the file index. For details, see the GFS paper published by Google engineers.

2. MapReduce.

Google found that most distributed operations can be abstracted as MapReduce operations. Map splits Input into Key/Value pairs in the middle, and Reduce combines Key/Value into final Output. The two functions are provided to the system by programmers. The underlying facilities distribute Map and Reduce operations on the cluster and store the results on GFS.

3. BigTable.

A large distributed database is not a relational database. Like its name, it is a huge table used to store structured data.

Open-source implementation

This distributed framework is very creative and highly scalable, making Google highly competitive in terms of system throughput. Therefore, the Apache Foundation uses Java to implement an open-source version that supports Linux platforms such as Fedora. Currently, Linux Hadoop is supported by Yahoo. Some Yahoo employees have been working on projects for a long time, and Yahoo is also preparing to use Linux Hadoop to replace the original FreeBSD-based system.

Linux Hadoop implements the HDFS file system and MapRecue. The current version is 0.16. Not yet mature, but it can be run on 2000 nodes. You only need to inherit MapReduceBase, provide two classes for Map and Reduce respectively, and register a Job to automatically run the distributed operation.

HDFS divides nodes into NameNode and DataNode. NameNode is unique. The program communicates with it and then accesses the file from DataNode. These operations are transparent and are no different from common file system APIs. MapReduce is primarily a JobTracker node, which allocates work and communicates with user programs.

The project is still in progress and has not reached version 1.0. The gap between the project and the Google system is also very large, but the progress is very fast and worth noting. In addition, this is the initial implementation of Cloud Computing and a bridge to the future.

Project homepage: http: // Linux Hadoop.apache.org a distributed system infrastructure developed by the Apache Foundation. You can develop distributed programs without understanding the details of the distributed underlying layer. Make full use of the power of clusters for high-speed computing and storage.

Shenzhen Universiade-Shenzhen 2011 Summer Universiade

  1. The Beta3 version of the Linux Mono Project has been released to learn more RPM here
  2. Linux CVS check whether xinetd is installed in the system
  3. Linux configuration files and user management related system files
  4. Full parsing of Linux File Types
  5. Introduction to Linux swap partition and experiment scenarios and processes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.