First Understanding of the Concept of Hadoop

Source: Internet
Author: User
Keywords concept of hadoop understanding big data and hadoop hadoop for beginner
Hadoop is the support of big data, so we will have some questions, what is Hadoop, what can Hadoop do, what are its advantages, and how it performs massive data operations. Believe that these questions must be bothering you at this time, don't worry, let's step by step to get to know the magical little elephant of Hadoop!
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.

origin

Since the birth of computers in 1946, and now in the era of artificial intelligence big data in 2020, our data has been showing a series of growth. In the past ten years, it may not be particularly obvious, but in recent years The amount of data, which we call massive data, feels unable to define its hugeness. Especially in the face of this year’s epidemic, we humans and our China’s big data have made outstanding contributions to our epidemic prevention and control. I believe everyone is particularly clear on the news and on the Internet-artificial intelligence ——Big Data——AI, the technological products in the new era, benefit every human being!

However, from the basic file data to the current data warehouse, our progress is always inspiring generations of IT talents. In May 2011, the concept of "big data" was proposed.

Several characteristics of big data:
1. Large amount of data
2. Many data types
3. Low processing speed (1 second law)
4. Low value density: For example, the monitors in the classroom are turned on every day, but when they are really effective, they will only have value after discovering "good things".

Google’s "troika" has changed traditional perceptions

Relying on the three papers of Google, GFS, MapReduce, and BigTab, the magical sparks collided in this way, laying a strong foundation for our big data technology, which is of epoch-making significance.

GFS thinking

The distributed file system has two basic components, one is the client and the other is the server. We found that the hard disk and security of the server were not obvious enough. At this time, our GFS solved this problem.
We will add a management node to manage these hosts that store data. The host that stores the data is called the data node, and the uploaded file will be divided into blocks according to a fixed size. The data block stored on the data node, rather than an independent file. The redundancy of the data block defaults to 3.

When uploading files, the client will first connect to the management node, and the management node will generate data block information, including file name, size, upload time, location information of the data block, etc. This information becomes the metadata of the file, which will be stored in the management node. After the client obtains these metadata, it will start uploading the data blocks one by one. The client first uploads the data block to the first data node, and then under the management of the management node, through horizontal replication, replication and distribution to other nodes (hosts), the redundancy requirement is finally reached.

data block
The smallest unit stored in hdfs

Default size 128M

Metadata
View fsimage
The entire file system namespace (including the mapping of blocks to files and file system attributes)
hdfs oiv -i file name to be viewed -o output file name -p XML
View edites
Every change in file system metadata
hdfs oev -i file name to be viewed -o output file name
namenode startup process
Load fsimage
Load edites
Checkpoint save
Waiting for datanode to report block information
After the datanode is started
Scan local block information
Report to namenode

Heartbeat mechanism
GFS Master communicates with each server (to ensure that it is alive), so as to maximize the reliability and availability of data

MapReduce ideas
It mainly introduces its "divide and conquer" idea. First, we introduce a web page level. For multiple web pages (hundreds of millions of copies), the calculation as a matrix is no longer sufficient, so what should we do? The matrix block is calculated, after this continuous superposition, the final calculation and the summary result. In fact, this idea is more transcendence of the times, whether it is in the use of computers, or in our daily study and life, "dispersing tasks, summarizing results" is the most practical.

BigTable thought
The basic idea of igTable is to store all the data in a table. The idea of BigTable is conducive to the retrieval of massive data. In the era of big data, it can significantly improve the efficiency of data query, but it is disadvantageous to add, modify, and delete data.

HDFS
HDFS is the core sub-project of the Hadoop project and the foundation of storage management in distributed computing.
Insert picture description here
HDFS, short for Hadoop Distributed File System, is an implementation of Hadoop abstract file system. The Hadoop abstract file system can be integrated with the local system, Amazon S3, etc., and can even be operated through the web protocol (webhsfs). HDFS files are distributed on cluster machines, and copies are provided for fault tolerance and reliability guarantee. For example, the client's direct operations of writing and reading files are distributed on each machine in the cluster, and there is no single point of performance pressure.

For rack sense and copy redundancy storage strategy: For example, our copy 1 is stored on rack 1, and our copy 2 will be stored on a different rack from the copy 1 for safety reasons. Here we save it on rack 2. For replica three, we should store it on the same rack as replica two. This is for efficiency considerations. Assuming our replica two is damaged, the nearest principle is to obtain it from other hosts in the same rack.

Features of Hadoop

1. High reliability
2. High scalability
3. High efficiency
4. High fault tolerance

Hadoop ecosystem

The core components of Hadoop are HDFS and MapReduce. With different processing tasks, various components have appeared one after another to enrich the Hadoop ecosystem. The current ecosystem structure is roughly as shown in the figure:
Insert picture description here
According to the service objects and levels, it is divided into: data source layer, data transmission layer, data storage layer, resource management layer, data calculation layer, task scheduling layer, and business model layer.

A detailed introduction to the Hadoop ecosystem

Hadoop is not suitable for real-time query events

Hadoop installation and environment setup configuration

Here is a complete set of installation instructions

HDFS (Distributed File System)

HDFS is the foundation of the entire hadoop system and is responsible for data storage and management. HDFS has high fault-tolerant characteristics and is designed to be deployed on low-cost hardware. And it provides high throughput (high throughput) to access application data, suitable for applications with large data sets.

Client: split files, when accessing HDFS, first interact with NameNode to obtain the location information of the target file, then interact with DataNode to read and write data

DataNode: The slave node stores actual data and reports status information to the NameNode. By default, one file will be backed up in three different DataNodes to achieve high reliability and fault tolerance.

Secondary NameNode: Secondary NameNode, to achieve high reliability, periodically merge fsimage and fsedits, and push them to NameNode; assist and restore NameNode in emergency situations, but it is not a hot backup of NameNode.

After installation, we first start Hadoop

start-all.sh
Enter after waiting

Just check, the different operating mechanisms above will appear

Types of applications that HDFS is not suitable for

1) Low-latency data access
HDFS is not suitable for applications that require delay in the millisecond level. HDFS is designed for high-throughput data transmission, so HBase may be more suitable for low-latency data access at the expense of latency.

2) A large number of small files
File metadata (such as directory structure, file block node list, block-node mapping) is stored in the memory of the NameNode, and the number of files in the entire file system is limited by the memory size of the NameNode.
In terms of experience, a file/directory/file block generally occupies 150 bytes of metadata memory space. If there are 1 million files, and each file occupies 1 file block, about 300M of memory is required. Therefore, the number of billion-level files is difficult to support on existing commercial machines.

3) Multi-party reading and writing, arbitrary file modification is required
HDFS uses the append-only method to write data. The modification of any offset of the file is not supported. Multiple writers are not supported.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.