Learn big data in one step: Hadoop ecosystems and scenarios

Last Update:2018-10-05 Source: Internet

Author: User

Tags shuffle zookeeper sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop overview

Whether the business is driving the development of technology, or technology is driving the development of the business, this topic at any time will provoke some controversy.

With the rapid development of the Internet and IoT, we have entered the era of big data. IDC predicts that by 2020, the world will have 44ZB of data. Traditional storage and technology architectures do not meet demand. In the Big Data era, published in 2013, the 5V features of big data are defined: Volume (mass), velocity (high speed), Variety (multiple), value (low value density), Veracity (authenticity).

Big Data Learning Group: 119599574

When we looked back 10 years, came 2003 years, this year Google published "Google File System", which proposed a GFS cluster consisting of multiple nodes, mainly divided into two categories: a master node, a lot of chunkservers. Then in 2004 Google published a paper and introduced MapReduce. In February 2006, Doug cutting and others applied gfs and MapReduce ideas to the Nutch project and evolved into a Hadoop project.

Doug Cutting once said that he really liked the feeling that his program was used by tens of thousands of people, and it was obvious that he did it;

In January 2008, Hadoop became an open source project for Apache.

The advent of Hadoop solves the massive data storage and processing in the internet era, which is a framework system that supports distributed computing and storage. If the Hadoop cluster is abstracted into a single machine, theoretically our hardware resources (CPU, memoery, etc.) can be infinitely extended.

Hadoop extends its application scenarios through its various components, such as offline analytics, real-time processing, and more.

Introduction to Hadoop related components

This article is mainly based on the Hadoop2.7 version, no special instructions are followed by this version

Hdfs

The Hdfs,hadoop Distributed File System (Hadoop distributed filesystem) is designed for distributed file systems running on general-purpose hardware (commodity hardware). It has a lot in common with existing distributed file systems, such as the typical Master/slave architecture (not intended to be introduced here); however, HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines.

About HDFs basically want to say two points.

The default number of replicas in HDFs is 3, and here's why a problem is 3 instead of 2 or 4.

Rack-aware (Rack awareness).

With a deep understanding of these two points to understand why Hadoop has a high degree of fault tolerance, high fault tolerance is the foundation on which Hadoop can run on common hardware.

Yarn

Yarn,yet another Resource negotiator (another resource coordinator) is another sub-project of Hadoop following common, HDFS, and MapReduce. Yarn appears because there are several issues in hadoop1.x:

Poor extensibility. Jobtracker has two functions of resource management and operation control.

Poor reliability. In the Master/slave architecture, there is a master single point of failure.

Low resource utilization. The Map slot (the unit of resource allocation in 1.x) is separate from the reduce Slot and cannot be shared between the two.

Multiple calculation frameworks cannot be supported. The MapReduce computing framework is a disk-based off-line computing model, and new applications require support for computational frameworks such as memory computing, streaming computing, iterative computing, and more.

Yarn by splitting the original Jobtracker as:

The Global ResourceManager (RM).

Each application has a applicationmaster (AM).

With yarn dedicated to resource management, Jobtracker can specialize in job control, yarn takes over TaskScheduler's resource management capabilities, and this loosely coupled architecture enables the flexibility of the overall Hadoop framework.

Hive

Hive is based on the Data Warehouse infrastructure on Hadoop, using simple SQL statements (HQL) to query and analyze data stored in HDFs. and convert SQL statements into MapReduce programs to handle the data.

The main differences between hive and the traditional relational database are the following:

Stored location hive data is stored in HDFs or HBase, and the latter is typically stored in a bare device or local file system.

Database update hive is not supported for updates, and is typically written several times a write.

Delayed hive execution of SQL is relatively high because each execution of HQL needs to be parsed into mapreduce.

The size of the data hive is typically a TB level, while the latter is relatively small.

Scalability on hive supports UDF/UDAF/UDTF, which is relatively poor.

HBase

HBase, a Hadoop Database, is a highly reliable, high-performance, column-oriented, scalable distributed storage System. Its underlying file system uses HDFS, using zookeeper to manage the communication between the cluster's Hmaster and each region server, monitor the status of each region server, store the entry address for each region, and so on.

HBase is a database in the form of key-value (analogous to map in Java). So since it's a database, there's definitely a table, and the table in HBase probably has the following features:

Big: A table can have billions of rows, millions of columns (Riedo, insert slows down).

Column-Oriented: column (family)-oriented storage and permission control, column (family) independent retrieval.

Sparse: For columns that are empty (null), they do not occupy storage space, so the table can be designed to be very sparse.

The data in each cell can have multiple versions, and by default the version number is automatically assigned, which is the timestamp when the cell is inserted.

Data in HBase is byte and has no type (because the system needs to adapt to different kinds of data formats and data sources, and cannot pre-define the schema in advance).

Spark

Spark is a distributed computing engine developed by the University of Berkeley that solves the problem of massive data streaming analysis. Spark first imports the data into the spark cluster and then quickly scans the data through a memory-based management approach, minimizing global I/O operations through an iterative algorithm, which is similar to the idea of Hadoop finding "data" from "computing".

Other Tools

Phoneix

Based on the HBase SQL interface, SQL statements can be used to manipulate the HBase database after Phoneix is installed.

Sqoop

The main role of Sqoop is to facilitate the migration of data from different relational databases to Hadoop, supporting a variety of databases such as Postgres,mysql.

Hadoop cluster hardware and topology planning

There is no optimal solution to the plan, just a balance of budget, data size, and application scenarios.

Hardware configuration

Raid

First raid is required, before answering this question, we first understand what is RAID0 and Raid1.

RAID0 is the principle of improving storage performance by spreading continuous data across multiple disks, so that the system has data requests that can be executed by multiple disks in parallel, and each disk performs its own portion of the data request. The parallel operation on this data can make full use of the bus bandwidth and significantly improve the overall disk access performance. (Source Baidu Encyclopedia)

What is the impact of RAID0 when combined with Hadoop?

Advantage:

Improve IO.

Speed up reading and writing.

Eliminate read and write overheating on a single disk.

However, in a Hadoop system, you need to reformat the entire raid and the data needs to be re-restored to Datanode when there is a problem with a piece of disk data in the RAID0 (or when read and write becomes slow). The entire cycle increases as the data increases.

Second, the bottleneck of RAID0 is the slowest disk in the raid, and when you need to replace the slowest one, it will reformat the entire RAID and restore the data.

RAID 1 enables data redundancy through disk data mirroring, resulting in data being backed up on paired independent disks. When raw data is busy, data can be read directly from the mirrored copy, so RAID 1 can improve read performance. RAID 1 is the highest unit cost in a disk array, but provides high data security and availability. When a disk fails, the system can automatically switch to read and write on the mirrored disk without having to reorganize the failed data. (Source Baidu Encyclopedia)

So the essence of RAID1 is to improve the data redundancy, and Hadoop itself is the default is 3 copies, so when there is Raid1 time, the number of replicas will become 6, will increase the system's demand for hardware resources.

Therefore, in the Hadoop system is not recommended for raid, in fact, it is more recommended JBOD, when a disk problem, the direct unmount and then replace the disk (many times directly to change the machine).

Cluster size and resources

This is mainly based on the amount of data to calculate the cluster size, regardless of CPU and memory configuration.

In general, we calculate the number of machines that need to be based on the requirements of the disk.

First we need to investigate the entire system's equivalent and incremental data.

For example, if the system now has 8T of data, the default number of copies is 3, then the required storage =8t*3/80% = 30T or so.

Each machine is stored as 6T, the number of data nodes is 5.

Add the master node, regardless of Ha case, about 6 machines.

Software configuration

Depending on whether the business requirements need to be configured for HA scenarios, the following scenarios are for reference only, due to the complexity and variability of the actual scenarios.

1. Non-HA Scenarios

It is generally considered that all management nodes are placed on one machine while several zookeeper services (odd) are started on the data node.

Management node: Namenode+resourcemanager+hmaster

Data node: Secondarynamenode

Data node: DataNode +regionserver+zookeeper

2.HA Solution

In Ha scenarios, primary node and standby node need to be placed on different machines, typically in real-world scenarios, taking into account the savings of the machine, may be different components of the master node to cross-prepare, such as a machine has primary namenonde and Standby Hmaster, the B machine has Standby NameNode and Primary Master.

Management node: NameNode (Primary) +hmaster (Standby)

Management node: NameNode (Standby) +hmaster (Primary)

Management node: ResourceManager

Data node: DataNode +regionserver+zookeeper

Design goals and scenarios for Hadoop

In fact, in the above overview of Hadoop, we can see what the original design goal of Hadoop was. Hadoop is synonymous with big data on many occasions. It is mainly used to deal with semi-structured and unstructured data (such as MapReduce).

Its nature is also through the MapReduce program to the semi-structured or unstructured data structure followed by subsequent processing.

Second, because Hadoop is a distributed architecture that is targeted at large-scale data processing, a relatively small amount of data does not reflect the benefits of Hadoop. For example, processing gigabytes of data can be relatively fast with traditional relational databases.

The scenarios for Hadoop based on the above are as follows:

Offline log processing, including ETL processes, is essentially a Hadoop-based data warehouse.

Massively parallel computing.

Schema resolution for Hadoop

Hadoop consists of two main parts:

Distributed File System (HDFS), mainly used for large-scale data storage.

Distributed computing Framework MapReduce, which is mainly used to compute the data on HDFS.

HDFs consists mainly of Namenode (Master) and Datanode (Slave). The former is mainly for the management of the namespace: such as the directory in HDFs, files and blocks to do similar file system creation, modification, deletion, list files and directories and other basic operations. The latter stores the actual data blocks and maintains a certain heartbeat with the Namenode.

MAPREDUCE2.0 's computational framework is the essence of yarn to complete, yarn is the focus of separation of ideas, yarn dedicated to resource management, Jobtracker can be specifically responsible for job control, yarn to replace TaskScheduler resource management function, This loosely coupled architecture enables the flexibility of the overall Hadoop framework.

MapReduce working principle and case description

MapReduce is the essence of Hadoop and a programming model for data processing. MapReduce can see the map and the two sections of reduce from the name above. The idea is similar to the First Division, map to the data to extract the conversion, reduce the data to summarize. It is important to note that the map task stores the output results on a local disk instead of HDFs.

In our execution of MapReduce, the relationship between map and database can be broadly divided into three categories:

Data Local

Rack Local

Cross-Rack

As can be seen from the above, it is assumed that a large amount of data movement in a mapreduce process is disastrous for execution efficiency.

MapReduce Data Flow

The relationship of MapReduce from data flow can be broadly divided into the following categories:

Single reduce

Multi-Reduce

No reduce

However, no matter what the mapreduce relationship is, the execution process for MapReduce is as follows:

where each map task is executed, regardless of the logic executed in the map method, the output is eventually written to disk. If there is no reduce phase, the output is directly to HDFs. If there is a reduce job, the output of each map method is cached in memory in the write disk front. Each map task has a ring-shaped memory buffer, which stores the output of the map, the default 100m, and each time the buffer is nearly full, a separate thread holds the buffer's data into the disk as an overflow file, and when the entire map task finishes, the map All overflow files generated by the task are merged into a partitioned and sorted output file. Then wait for the reduce task to pull the data.

This process in fact also mapreduce the famous Shuffle process.

MapReduce Practical case

Raw Data

The original data file is an ordinary text file that has a year in each line of records and the temperature of each day in the modified year.

Map

In the map process, each row of records is generated a key,key is generally the number of rows in a file (Offset), for example, 0,106 in the first row, the 107th row. Where the bold place represents the year and the temperature.

Shuffle

The process gets the records that you want to make up the key value pair {year, temperature}.

Sort

The value of the same key in the previous step is composed of a List, which is the {year,list< temperature;}, to the reduce side.

Reduce

The reduce side processes the list, obtains the maximum value, and then outputs it to HDFS.

Big Data Learning QQ Group: 119599574

Learn big data in one step: Hadoop ecosystems and scenarios

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More