Using Hive to build a data warehouse: A money-saving data Warehouse

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3 guys came to a business. The first (data Warehouse) was burly: He brought history and experience, and he could say that most of the words were true. But, in many ways, it is bloated and wasteful in other ways, and people are tired of the costs of the various outcomes. Apache Hadoop entered the same building, claiming to take over the entire market. He preaches big data, speed, data, variety, and a bunch of V-words that don't make much sense outside of marketing programs. He casually spoke of analysis, prediction, and so on. And he has a low price. So people started to stop and listen.

Apache Hive is wandering outside the house, and he has no intention of arguing with anyone else. He wants to work with Hadoop, but unlike Hadoop, he doesn't want to leave the data warehouse on the curb. Hive has data warehousing capabilities, but there are some limitations on business intelligence (BI) and analytics. It has the potential of the database, but it also has limitations in relational database management systems (RDBMS) and Structured Query Language (SQL). It is more open and honest. It is closely related to the Data warehouse and the RDBMS. But it never stood up and claimed it wasn't as simple as it seemed at first. Hadoop interrupted the conversation by claiming it was a data warehouse in the Hadoop field. Hadoop seems to give up the best marketing PR rep, and after a simple conversation, it turns out that Hive and Hadoop are saving the world. The description is fascinating and interesting. But is it true? is somewhat similar.

Data Warehouse

Building a real Data warehouse can be a huge project. There are many different devices, methods and theories. What is the greatest common value? What are the facts and what are the topics related to these facts? And how do you mix, match, merge, and integrate systems that may have existed for decades and that were implemented only a few months ago? This is still before big data and Hadoop. By adding unstructured, data, NoSQL, and Hadoop to your portfolio, you'll soon get a huge data integration project.

The simplest way to describe a data warehouse is to recognize that it can be summed up in the form of star patterns, facts, and dimensions. How do you create these elements and decide on your hands-by staging the database, dynamically extracting, converting, loading processes, or integrating secondary indexes. Of course, you can build a data warehouse that contains star patterns, facts, and dimensions, but it's not easy to use Hive as the core technology. Outside of the Hadoop world, this can be a bigger challenge. Hive is not so much a legitimate data warehouse as a tool for integration, transformation, and quick lookup. The pattern may look like a data warehouse, but applicability suggests that it is not an RDBMS. So why use it?

What is a star pattern?

Imagine a star-with a center and multiple "arms" pointing in different directions. The center is the source or fact table of power. All of the arms are pointing to different dimensions. Many data warehouses have a fact table and a number of dimensions.

The fact table contains any data that you can weighting or calculate. In this example, you have baseball statistics, such as base runs, home runs, batting rates, and so on. You can calculate, increment, subtract, or multiply these columns.

Dimensions are more based on themes. In this example, you have athlete information dimensions, time and date dimensions, and so on. Columns in multiple dimensions are usually not computed or weighted.

In this example, the key that connects a dimension table to a fact table is playerID.

In simple terms, sometimes you need to use the tools that are in front of you.

Anyone who has worked on IT for some time may tell you that the right tool for a job is not always available. Or, the right tools are available, but to cut costs will hinder the use of the tool. Sometimes business politics plays a major role. For whatever reason, most of us are forced to build, design, and develop in many cases by using tools that may not be best suited to their work.

In many of the projects I have been involved in, I have had to use Hive as a database, as a data warehouse, and as a slow-changing system. It's challenging, but occasionally annoying. Sometimes, you have to shake your head and wonder why. But at the end of the day, you still need to let it work. If you need to build and use a data warehouse in Hive, and you need to use slowly changing dimensions and updates, and reconcile old data, you must do so. The focus is not always to provide the best tools, but rather to create the tools that work best for you.

Hive

Because of Hive's class SQL capabilities and class database capabilities, it opens large data Hadoop ecosystems to non programmers. It is often described as a data warehouse infrastructure built on Hadoop. This is a partial-true representation (because you can convert the source data to star mode), but it is more about design than technology when creating fact tables and dimension tables.

Still, Hive is not really a data warehouse. It's not even really a database. You can build and design a data warehouse using Hive, or you can use Hive to build and design database tables, but there are a number of limitations that need to provide a number of solutions and pose some challenges.

For example, indexes have some limitations in Hive. How can we overcome this problem? You can use the Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler function to create an index in hive. Hive and slowly changing dimensions are not always possible implementations. However, if you build a staging table and use a certain number of connections (and you plan to add a new table, dump the old table, and keep only the latest and updated tables for comparison), you might implement them.

An external reporting or analysis system connected to the Hive is a huge problem. Even for JDBC connections, it is limited to connecting to the default database. People are looking for more improved metadata, and tools such as Apache Hcatalog are helping to connect various services to the Hive Meta store. In the future, if used properly, this could be a significant value-added area.

Therefore, although Hive is not a reliable data warehouse or database, there are some ways to use Hive as a data warehouse or database. Just need to do some work and use some solutions to make Hive into such a system. Why do you have to go through the process again? Because you have to use the tools at hand and make them work.

Example: Building a data warehouse for baseball information

Infosphere biginsights Quick Start Edition

Infosphere biginsights Quick Start Edition is a free downloadable version of Infosphere biginsights (IBM's Hadoop based product). With the Quick Start Edition, you can try to use the features developed by IBM to improve the value of open source Hadoop, such as Big SQL, text analysis, and bigsheets. Guided learning can make your experience as smooth as possible, including step-by-step, self-paced tutorials and videos to help you get started with Hadoop for you. Without time or data limitations, you can schedule your own time and experiment on a large amount of data. Watch videos, learn tutorials (PDF) and download biginsights Quick Start Edition.

The following baseball data example shows how to design and build a data warehouse using baseball data from the Sean Lahman Web site in Hive. I like to challenge the reverse normalization of the Data Warehouse (denormalizing) and build a data warehouse from that data. In using Hive to build a library for data, I used VMware Fusion to create an IBM infosphere biginsights virtual Machine (VM) on my Macbook. This is a simple test, so my VM has only 1 GB RAM and GB solid state hard disk storage space. The operating system is a Linux CentOS 6.4 64-bit release.

To begin using this example, download the IBM infosphere biginsights Basic Edition. You need to have an IBM Universal ID or register to get an ID before you can download Infosphere biginsights Basic Edition.

Import data

First download a CSV file that contains statistics for baseball and baseball players (see downloads). Create a directory within Linux, and then run:

$ Sudo Mkdir/user/baseball.

sudo wget http://seanlahman.com/files/database/lahman2012-csv.zip

The example contains 4 primary tables, each with a unique column (Master, batting, pitching, and Fielding) and several secondary tables.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More