Introduction to Hadoop / Hive

Source: Internet
Author: User
Keywords Yes data warehouse dfs at the same time
Tags analysis application backup based class computing course data

hive is a Hadoop-based data warehouse tool that maps structured data files to a database table and provides full sql query capabilities to convert sql statements to MapReduce jobs. The advantage is low learning costs, you can quickly achieve simple MapReduce statistics through class SQL statements, without having to develop a dedicated MapReduce application, is very suitable for statistical analysis of data warehouse.

Hadoop is a storage computing framework, mainly consists of two parts:

1, Storage (Hadoop Distributed File System - HDFS)

2, calculation (MapReduce computing framework)

1, Hadoop distributed file system

This is a file system implementation, similar to NTFS, ext3, ext4, etc., but it is built on a higher level. Files stored on HDFS are divided into blocks (each default 64M, larger than the average file system block size, adjustable) distributed across multiple machines, each of which will have more than one piece of redundant backup (Default is 3) to enhance the file system's fault tolerance. This storage model works well with the MapReduce calculation model, which will be described later. HDFS in the specific implementation mainly has the following sections:

First, the name node (NameNode): Its job is to store the entire file system metadata, which is a very important role. Metadata is loaded into memory at cluster startup, and metadata changes are also written to a file system image on disk (while maintaining an editorial log of the metadata). The current name of the node is still a single point. Because HDFS stores files by dividing the files into logical blocks, the files are stored on the name nodes corresponding to those blocks, so data that will corrupt the entire cluster will not be available. Of course, we can take some measures to back up the name node metadata (file system image file), such as the name of the node directory can be set to both a local directory and an NFS directory, so that any metadata changes will be written to two locations Redundant backup, the process of writing to two redundant directories is atomic. This way, after the name node in use is down, we can use the backup file on NFS to recover the file system.

Second, the second name node (SecondaryNameNode): The role of this role is to regularly edit the log merge namespace image, to prevent excessive editing log. However, the status of the second name node lags behind the main name node, so if the main name node hangs up, there must be some file loss as well.

Third, the data node (DataNode): This is where HDFS specifically store data, generally have multiple machines. In addition to providing storage services, they also periodically send lists of their stored blocks to name nodes, so it is not necessary for name nodes to persist the data nodes where each block of each file resides, which are reconstructed by the data nodes after system startup.

2, MapReduce computing framework

This is a distributed computing model, the core of which is that the task is decomposed into small tasks by different operators to participate in the calculation at the same time, and the calculation result of each operator is combined to obtain the final result. The model itself is very simple, generally only need to achieve two interfaces can be; the crux of the problem lies in how to translate the actual problems into MapReduce tasks. Hadoop MapReduce part mainly consists of the following components:

First, the job tracking node (JobTracker): responsible for the task of scheduling (you can set a different scheduling strategy), status tracking. Its role is somewhat similar to the name node in HDFS, and JobTracker is also a single point and may improve in future releases.

Second, the task tracking node (TaskTracker): responsible for the implementation of specific tasks. It tells JobTracker "heartbeat" of its state, and JobTracker assigns tasks to it based on the status of its reports. TaskTracker launches a new JVM to run a task, of course, the JVM instance can be reused.

The above is an introduction to the two most important parts of Hadoop, Hadoop reason is that it is adapted to the calculation of big data storage. A Hadoop cluster can be thought of as a "library" of storage, computing "data."

Hive is a "data warehouse" application built on Hadoop clusters

Hive is a data warehouse application developed by Facebook that builds on Hadoop clusters. It provides an HQL statement similar to SQL syntax as the data access interface, which slows down the learning curve of Hadoop, a common analyst's application. As for why Facebook used Hadoop and Hive to build its data warehouse, its insiders shared some of their experiences. The general process is as follows:

1, Facebook's data warehouse was built on top of MySQL, but as the amount of data increases, some queries can take hours or even days to complete.

2, when the amount of data close to 1T, mysqld background process down, then they decided to transfer their data warehouse to Oracle. Of course, the transfer process is also paid a great deal, such as support for different SQL dialects, modify the previous run script and so on.

3, Oracle deal with a few T data is still no problem, but began to collect user clickstream data (about 400G a day), Oracle began to hold up, which in turn must consider the new data warehouse program.

4, internal developers spent a few weeks building a parallel log processing system Cheetah, so reluctantly able to handle clickstream data within a day of 24 hours.

Cheetah also has many shortcomings. Later discovered the Hadoop project, and began to try to log data simultaneously Cheetah and Hadoop contrast, Hadoop advantage in dealing with large-scale data, and later all the workflow will be transferred from Cheetah to Hadoop, and based on Hadoop to do A lot of valuable analysis.

Later, in order to make Hadoop available to most people in the organization, Hive was developed. Hive provides a query interface similar to SQL, which is very convenient. At the same time also developed some other tools.

7, the cluster now stores 2.5PB of data, and the daily growth of 15TB of data, submit more than 3000 jobs per day, about 55TB of data processing ...

Nowadays, many large Internet companies are researching and using Hadoop for cost reasons. The value of data is getting more and more people's attention, and this emphasis reflects the tremendous value of Hadoop.

Reproduced: http: //zhangwei20086.blog.163.com/blog/static/230557182012619111956724/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.