Cloud Computing's Razor: rapid deployment of Hadoop clusters

Source: Internet
Author: User
Keywords Cloud computing Name dfs all

Cloud computing has been seen as a new trend in the IT industry recently. Cloud computing can be roughly defined as a scalable computing resource that is provided with a service outside of its own environment and paid for by usage. You can access any resource in the cloud over the Internet without having to worry about computing power, bandwidth, storage, security, and reliability.

From an enterprise perspective, growing information is hard to store in standard relational databases or even data warehouses. These questions refer to some problems that have existed for many years in practice. For example: How to query a 1 billion-row table? How do I run a query across all the logs on all the servers in the datacenter? The more complicated problem is that the large amount of data that needs to be processed is unstructured or semi-structured, which is even harder to query.

The "Cloud computing" field has become a major battleground for the future "duel" of many multinational it giants. Aware that "cloud computing" would be a landmark change in the IT landscape, nearly all heavyweight multinational it giants began to take root in "cloud computing" from different fields and perspectives, with the main players of Amazon, Google, IBM, Mircosoft, VMware, Cisoco, Intel, AMD, Oracle, SAP, HP, Dell, Citrix, Redhat, Novell, Yahoo, etc. Silicon Valley now has about 150 companies involved in "cloud computing", and new business models are emerging.

The huge market size of "cloud computing" is beyond imagination. According to the most optimistic estimate, IDC calculates the next 3 years global "cloud computing" domain will have the new business revenue of 800 billion dollars. It is clear that the reason why the global IT giants compete in the "Cloud computing" field is the future astronomical size of the market and the resulting bright prospects for development. Since 2011, major IT companies have launched a smudge battle to achieve their future dominance in the "cloud computing" market.

Introduction to Hadoop

Apache Hadoop is a software framework that can manipulate large amounts of data in a distributed way. It was first mentioned in 2006 and supported by companies such as Google, Yahoo! and IBM. It can be considered a PaaS model.

Its design core is the MapReduce implementation and HDFS (Hadoop Distributed File System), which originate from MapReduce (introduced by a Google file) and Google file system.

MapReduce is a software framework introduced by Google that supports distributed computing of large datasets on a computer (i.e., node) cluster. It consists of two processes, mapping (map) and reduction (reduce).

During the mapping process, the master node receives input, splits the input into smaller subtasks, and then distributes the subtasks to the worker node.

The worker nodes handle these small tasks and return the results to the master node.

Then, during the reduction process, the primary node combines the results of all subtasks into output, which is the result of the original task.

The advantage of MapReduce is that it allows for distributed processing of mappings and reduction operations. Because each mapping operation is independent, all mappings can be executed in parallel, which reduces the total calculation time.

For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move, or rename files, and so on. But HDFS's architecture is based on a specific set of nodes, which is determined by its own characteristics. These nodes include the Namenode (only one) that provides the metadata service within the HDFS; DataNode, which provides a storage block for HDFS. Because there is only one namenode, this is a disadvantage of HDFS (single point failure).

Files stored in HDFS are partitioned into blocks and then replicated to multiple computers (DataNode). This is very different from the traditional RAID architecture. The size of the block (typically 64MB) and the number of blocks copied are determined by the client when the file is created. Namenode can control all file operations. All communications within the HDFS are based on the standard TCP/IP protocol.

Namenode is a software that is typically run on a separate machine in a HDFS instance. It is responsible for managing file system namespaces and controlling access by external clients. Namenode determines whether the file is mapped to a copy block on the DataNode. For the 3 most common blocks of replication, the first copy block is stored on different nodes in the same rack, and the last copy block is stored on a node in a different rack. Note that you need to understand the cluster architecture here.

The actual I/O transaction is not Namenode, only metadata representing DataNode and block file mappings is passed through Namenode. When an external client sends a request to create a file, Namenode responds with the block ID and the DataNode IP address of the first copy of the block. The Namenode also notifies other DataNode that will receive a copy of the block.

Namenode stores all information about the file System namespace in a file called Fsimage. This file and a record file containing all transactions (here is Editlog) will be stored on the Namenode local file system. Fsimage and Editlog files also require replicas to prevent file corruption or Namenode system loss.

(Responsible editor: Duqing first)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.