One of hadoop practices ~ Hadoop Overview

Source: Internet
Author: User
Preface

I still have reverence for technology.

Hadoop Overview

Hadoop is an open-source distributed cloud computing platform based on the MAP/reduce model to process massive data.Offline analysis tools. Developed based on Java and built on HDFS, which was first proposed by Google. If you are interested, you can get started with Google trigger: GFS, mapreduce, and bigtable, I will not go into details here, because there are too many materials on the Internet.

The hadoop project structure is as follows:

 

HDFS and mapreduce are the most important tasks in hadoop. Start with HDFS:

HDFSIt has the following advantages:
1) supports ultra-large files. Generally, a hadoop file system can easily store TB and Pb-level data.
2) detection and quick response to hardware faults are common in clusters built on a large number of general and inexpensive hardware, the first-class HDFS system is composed of hundreds of thousands of servers storing data files. The more servers there are, the higher the failure rate. Therefore, fault detection and brake recovery are a design goal of HDFS.
3) streaming data access methods: HDFS processes a large amount of data, and applications need to access a large amount of data at a time,Suitable for batch processing rather than interactive user data processing. HDFS accesses data in streaming mode and focuses on the high data throughput rather than the access speed.. HDFS is based on the concept that the most effective data processing mode is the write-once, read-only-times mode. When the write operation is disabled, it is not supported to update some content in the file. The dataset stored in HDFS is used as the analysis object of hadoop. After a dataset is generated, various analyses are performed on the dataset for a long time. Most or even all of the data in the dataset will be designed for each analysis. Therefore, the time delay for reading the entire dataset is more important than the time delay for reading the first record. (Streaming reading minimizes the addressing overhead of the hard disk. You only need to address it once and then read it all the time.After all, every file block is divided into 64 MB.. The physical structure of the hard disk causes the addressing overhead to be optimized to keep up with the read overhead. Therefore, stream reading is more suitable for hard disk features. Of course, the features of large files are also more suitable for stream reading. AndStream Data AccessCorrespondingRandom Data AccessIt requires a low latency in locating, querying, or modifying data. traditional relational databases are very compatible with this)

4) In a simplified consistency model, most HDFS data manipulation files are written at a time and read multiple times. Once a file is created, written, and closed, it does not need to be modified in half, this simple consistency model is conducive to providing a high-throughput data access model.
For the above design goals, we will find that these are advantages in some scenarios, but in the case of silent writing, they will become their limitations, mainly including the following:
1 ),It is not suitable for low-latency data access. HDFS is designed to process large-scale data, mainly for high data throughput. HDFS can sacrifice low latency for high throughput, therefore, we cannot expect to quickly read data in HDFS.
If you want to perform low-latency or real-time data access on hadoop, hbase is a good solution. However, hbase is a nosql database that is column-oriented.
2) NoFunny storage of a large number of small filesIn HDFS, there are namenode (master) nodes to manage the metadata of the file system, and the corresponding client requests have returned the file location, etc, therefore, the limit on the number of files is determined by namenode (specifically its memory size). In addition, during a data access, more small files also mean more disk addressing operations, as well as more file opening and closing overhead, which greatly reduces data throughput, this is against the HDFS design goal and puts more pressure on namenode.
3) Data Writing and random file access and modification by multiple users are not supported. In an HDFS write operation, only one user can write one file, in addition, you can append the data to the end of the file. When reading a file, you can only read the file data from the file header in sequence.
Currently, HDFS does not support concurrent write operations, random access, and data modification on the same file by multiple users. 

The HDFS architecture is as follows:

 

Hive Data Management

Okay, now let's talk about the main line. hive, the main target of this pre-study is it. As shown in the preceding figure, hive is the data warehouse infrastructure on hadoop. How can we compare this detour? To put it bluntly, it is:

1. It does not store data. The real data is stored on HDFS.

2. It provides an SQL-like language that can provide external interfaces for data storage, query, and analysis. In short, it is much more comfortable than directly querying hadoop data.

Next, we will analyze hive data management from three aspects:

1. Metadata Storage

Hive stores metadata in RDBMS, So you usually find on the Internet that you need to install MySQL or other RDBMS databases to install hive, A friend who cannot understand the metadata can understand it as the basic information of the data, but does not include the actual data.

2. Data Storage

Hive does not have a special data storage format or index data. All its data is stored in HDFS. Because hadoop is a batch processing system, tasks are highly delayed, it also consumes some time during Task submission and processing.The real-time hive processes very small datasets and may also experience latency during execution.In this way, the performance of hive cannot be compared with that of traditional oracle.In addition, hive does not provide data sorting and query cache functions, and does not provide online transaction processing. In other words, it does not support direct connection with users, but is suitable for offline processing. Of course, real-time queries and record-level updates are not provided.Hive is suitable for processing Batch Tasks on unchanged large-scale datasets (such as network logs). The greatest value is scalability, scalability, good compression rate, and low-constraint data input formats.

Hive Architecture

Thrift is a communication framework similar to ice.

After the introduction, I believe that you can clearly understand the applicable scenarios of hadoop and hive. The specific use is not needed. It depends on the specific business scenarios ~

In the next article, we will introduce hadoop cluster construction and hive instances.

 

One of hadoop practices ~ Hadoop Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.