Hive learning Roadmap

Source: Internet
Author: User
Hive learning Roadmap

The hadoop family articles mainly introduce hadoop family products. Common projects include hadoop, hive, pig, hbase, sqoop, mahout, Zookeeper, Avro, ambari, chukwa, new projects include yarn, hcatalog, oozie, Cassandra, hamr, whirr, flume, bigtop, crunch, and hue.

Since 2011, China has entered the age of big data. Family software represented by hadoop occupies a vast territory of big data processing. Open-source communities and vendors, and all data software, are moving closer to hadoop. Hadoop has also changed from the niche high-tech field to the big data development standard. Based on the original hadoop technology, a hadoop family product emerged. The concept of "Big Data" is constantly innovated and technological advances are introduced.

As a developer in the IT field, we must keep up with the pace, seize the opportunity, and start with hadoop!

About

  • Conan, Java, R, PHP, and JavaScript
  • Weibo: @ conan_z
  • Blog: http://blog.fens.me
  • Email: [email protected]

Reprinted please indicate the source:Http://blog.fens.me/hadoop-hive-roadmap/

Preface

Hive is a data warehouse product in the hadoop family. The biggest feature of hive is that it provides SQL-like syntax, encapsulates the underlying mapreduce process, and enables SQL-based business personnel, you can also directly use hadoop for big data operations. This solves the bottleneck of the original data analysts for big data analysis.

Let's build a hive environment to help non-developers better understand big data.

Directory

  1. Hive Introduction
  2. Hive learning Roadmap
  3. My use experience
  4. Use Cases of hive
1. Hive Introduction

Hive originated from Facebook, which makes it possible to perform SQL queries on hadoop, so that it can be conveniently used by non-programmers. Hive is a hadoop-based data warehouse tool that maps structured data files into a database table and provides a complete SQL query function. It can convert SQL statements into mapreduce tasks for running.

Hive is the basic architecture of data warehouse built on hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in hadoop. Hive defines a simple SQL-like query language called hql, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with mapreduce to develop custom Mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in Mapper and reducer.

For details about how to install and use hive, refer to: hive installation and Usage Guide.

2. Hive learning Roadmap

I have already listed the hive knowledge points in the figure and hope to help others better understand hive.

The next step is my experience, and there is no shortcut to anyone. It's not that difficult to put your mind down.

3. My use experience

I have two considerations for using Hive:

  • 1. Help data analysts without development experience and be able to process Big Data
  • 2. Build a standardized mapreduce Development Process

1) Help data analysts without development experience and be able to process Big Data

It fully complies with the hive design philosophy and has been emphasizing that there is no need to say more.

2) Build a standardized mapreduce Development Process

This is the direction we need to work on.

First, hive has encapsulated the mapreduce process using SQL-like syntax, which is the standardization process of mapreduce.

We use logic encapsulation for scenarios when doing business or tools. This is the second layer encapsulation on hive. In Layer 2 encapsulation, we should try to block the details of hive as much as possible, so that the interface can be single, with little flexibility, and the syntax structure of hql can be reduced again. It only meets our system requirements and provides dedicated interfaces.

When using the secondary encapsulated interface, we do not need to know what hive is, or what hadoop is. We only need to know how to write highly efficient SQL queries (sql92 standard) and how to write them to complete the business needs.

After completing the secondary encapsulation of hive, we can build a standardized mapreduce development process.

Through the idea, we can unify the dependence of various internal applications on hive. When the quality of personnel increases, we can strip hive and replace hive with better underlying solutions, if the encapsulated interface remains unchanged and the service usage is unknown when hive is replaced, we have replaced hive.

This process is required and meaningful. When I was thinking about building a hadoop analysis tool, using hive as the hadoop access interface is the most effective.

3). Hive O & M:Hive is built based on hadoop. In short, it is a set of hadoop access interfaces. Hive itself does not have many things. Therefore, we should pay attention to the following issues during O & M.

  • 1. Use a separate database to store metadata
  • 2. Define reasonable table partitions and keys
  • 3. Set Reasonable bucket data volume
  • 4. Compress tables
  • 5. Define External table usage specifications
  • 6. reasonably control the Mapper and CER quantity
4. Use Cases of hive

Cases that have been compiled into articles

  • Hive installation and usage
  • Test the import of 10 GB Data from hive
  • R Lijian nosql series Article hive
  • Use rhive to extract reverse repurchase information from historical data

Related Articles:Hadoop family product learning Roadmap

Reprinted please indicate the source:Http://blog.fens.me/hadoop-hive-roadmap/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.