Hadoop: Sorting out Various Concepts

Source: Internet
Author: User
Keywords hadoop hadoop architecture hadoop big data
Hadoop
Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS.

HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data sets). set) application. HDFS relaxes the requirements of POSIX, and can access data in the file system in the form of streaming access.

The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

What problems does Hadoop solve?

Mass data needs timely analysis and processing

Mass data needs in-depth analysis and mining

Data needs to be stored for a long time

Problems with massive data storage:

Disk IO is called a bottleneck, not CPU resources

Network bandwidth is a scarce resource

Hardware failure has become a major factor affecting stability

Hadoop related technologies
Hbase
Nosql database, Key-Value storage
Maximize the use of memory
HDFS
hadoop distribute file system (distributed file system)
Maximize the use of disk
MapReduce
Programming model, mainly used for data analysis
Maximize the use of CPU
Centralized system
The centralized system can be summarized in one sentence: a host with multiple terminals. The terminal has no data processing capability and is only responsible for data input and output. The calculation and storage are all performed on the host. Most of the current banking systems are such centralized systems. In addition, they are also distributed in large enterprises, scientific research institutions, the military, and the government. Centralized systems were mainly popular in the last century.

The biggest feature of the centralized system is that the deployment structure is very simple. The bottom layer generally uses expensive mainframes purchased from vendors such as IBM and HP. Therefore, there is no need to consider how to deploy the service on multiple nodes, and there is no need to consider the distributed collaboration between nodes. However, due to the stand-alone deployment. It is likely to bring problems such as large and complex systems, difficult to maintain, single point of failure (when a single point fails, it will affect the entire system or network, resulting in the paralysis of the entire system or network), poor scalability and other issues.

Distributed system
A group of independent computers collectively provide services to the outside world, but to the users of the system, it is like a computer providing services. Distributed means that more ordinary computers (compared to expensive mainframes) can be used to form a distributed cluster to provide external services. The more computers there are, the more CPU, memory, storage resources, etc., and the greater the amount of concurrent access that can be processed.

A standard distributed system should have the following main characteristics:

Distribution
Multiple computers in a distributed system can be randomly distributed in space. There is no master or slave among multiple computers in the system, that is, there is no master or slave that controls the entire system.

Transparency
System resources are shared by all computers. Users of each computer can not only use the resources of the machine, but also the resources of other computers in the distributed system (including CPU, files, printers, etc.).

Identity
Several computers in the system can cooperate with each other to complete a common task, or a program can be distributed on several computers and run in parallel.

Communication
Any two computers in the system can exchange information through communication.

Distributed data and storage
Large-scale websites often need to process massive amounts of data, and a single computer often cannot provide enough memory space for distributed storage of these data.

Distributed Computing
With the development of computing technology, some applications require huge computing power to complete. If centralized computing is used, it will take a long time to complete. Distributed computing breaks the application into many small parts, which are distributed to multiple computers for processing. This can save the overall calculation time and greatly improve the calculation efficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.