Using Hive as an ETL or ELT tool

Last Update:2015-11-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview of ETL and ELT tools for data processing

Data integration and data management technologies have existed for a long time. Tools for extracting, transforming, and loading (ETL) data have changed traditional databases and data warehouses. The in-Memory conversion ETL tool now makes extracting, loading, converting (ELT) and ETL faster. For big data, is it possible to use built-in Hadoop tools instead of using traditional ETL tools to extract, load, and transform data?

Most ETL packages require their own servers, processes, databases, and licenses, and experts install, configure, and develop them in that particular tool, and these skills are not always transferable. Microsoft®sql server®integration Services or IBM infosphere®datastage® experts may not know how to use Informatica or Pentaho. To avoid the learning curve involved in using new tools, consider using tools from the Hadoop ecosystem instead. Apache Hive and Apache Pig (including in the Hadoop ecosystem) are leaders in extracting, loading, and transforming data in various forms. Unlike many traditional ETL tools that specialize in structured data, Hive and Pig are created to load and transform unstructured, structured, or semi-structured data in a Hadoop Distributed File System (HDFS).

Hive build?? The traditional database and data Warehouse concepts. It treats the data as if it had an SQL-based or architecture-based architecture. In hive, you can load data into HDFS, or load data directly into a hive table. However, Pig is much more similar to the standard ETL scripting language. In Pig, there may be a pattern in your mind, but you're more concerned with using more complex features to transform and integrate data in HDFS instead of simply putting them into a particular table or database. Because both Pig and Hive use the MapReduce feature, they may not be as fast in performing non-oriented batch processing. Some open source tools try to change this limitation, but the problem persists.

The advantages and limitations of traditional ETL and ELT concepts Choose ETL or ELT

The traditional ETL is well known throughout the industry. You can extract data from data sources A, B, and C. You create and develop different ways to integrate, deserialize (de-normalize), and transform those data through a certain workflow and data integration process. Finally, you load the integrated data into the data warehouse or database and try to automate the process.

Conversely, because of the introduction of Hadoop technology and the fact that hardware and storage have become much cheaper, another concept (ELT) is gaining popularity. Data is still extracted from data sources A, B, and C, but instead of being converted, the raw data is loaded into the database or HDFS. Typically, the load process does not need to use patterns, and the data can remain unhandled in the repository (actually archived) for a long time. When data is needed, someone can build a pattern, transform the data, and determine how to parse the data. This person can even load new, converted data onto another platform, such as Apache HBase.

The advantage of using ELT is that the raw data can remain in storage for a long time, and others can use the same data in the way they want, rather than using the anti-normalization method and the way in which the system was built five years ago.

Over the years, ETL technology and tools have been almost completely unchanged, especially in the area of data warehousing. The tool has improved, but the method remains largely unchanged. You extract data from a variety of sources, run a set of scripts or ETL workflows to transform that data, and then load it into a star-or semi-normalized data warehouse or master Data management system.

Most data professionals are familiar with ETL. Issues such as change management, slowly changing dimensions, insertions, updates, and release announcements have been resolved many years ago or found a solution. Because data in the data warehouse is not always reliable, people rely on Microsoft Excel® spreadsheets to store data. This method has formed its concept and methodology, and has achieved the corresponding strategy, but there are still many different methods.

The key limitation of the ETL concept is that early in the process, someone must determine what data is important, what data needs to be updated, what data needs to be put aside, and who can obtain the data license. A data warehouse or master Data management system stores only data that is considered important by someone. The original raw data is not stored and cannot be retrieved. Data marts and converted data become the only available data, and even if it is a subset of data, it is possible that the person who created and designed it may not even be working in the company, possibly having different ideas about which data are important.

Given these limitations, people are starting to find solutions, such as storing data in a local database. Departments have built their own silos and data marts, and suddenly, master data becomes an interesting concept, not a reality. Data is not an integrated data. Sales, marketing, and finance teams have different data. Numbers and dashboards are unreliable and untrustworthy. Obviously, ETL cannot accommodate big data.

Using Hive as an alternative to traditional ELT tools

Apache Hive Data Warehouse software helps you to query and manage large datasets that reside in distributed storage. For ETL, Hive is a powerful tool, and for Hadoop, it is both a data warehouse and a database for Hadoop. However, it is relatively slow relative to the traditional database. It does not provide all of the SQL features, and does not even provide the same database features as the traditional database. But it supports SQL, and it does work like a database, allowing more people (even those who aren't programmers) to get Hadoop technology. It provides a way to transform unstructured and semi-structured data into patterns-based, available data. Want to build a master data management system? You can take advantage of Hive. Want to build a data warehouse? You can also take advantage of hive, but you need to learn some tricks to make hive a powerful ETL tool.

Compared to Apache Pig and mapreduce,hive, it makes it easier for traditional RDBMS database developers or others who understand SQL to access and transform data in Hadoop. However, Pig is not easy to understand, for those who do not have a background in software development, the learning curve is steep. MapReduce is a technology that Java™, C + +, and Python programmers can learn relatively quickly. However, without a technology (such as Java) Foundation, it is almost impossible to learn MapReduce. So, if you know SQL, it's easier to learn and use Hive.

Example: How to implement ELT using Hive

I extracted a comma-separated value (CSV) file from the World Bank website. The site has a wealth of samples and real data on economics, finance, poverty, and so on. In this example, I downloaded world-finance, inequality, and poverty 1958-1998. These data will be used for research and educational purposes, and they are a collection of samples from 52 developing and developed countries. These include indicators such as private credit, inflation, gross domestic product (GDP), GDP growth, and income ratio, which provide the potential for insight.

In this example, I use Hive to extract data from a Web site, load it, and convert it. The goal of this use case is to remove standards from various sources (in this case, 4 CSV files) and then perform a few simple summaries of a column. After you read this article and step through the example, you will know how to use Hive to implement the ELT functionality. To learn how to perform traditional ETL functions, simply reverse the process, first using Hive to transform and summarize the data, and then load it.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a free, downloadable version of InfoSphere BigInsights and is an IBM Hadoop-based product. With the Quick Start Edition, you can try out some of the features of core Hadoop (HDFS, MapReduce) and other services in the ecosystem, such as Pig, Hive, and Apache ZooKeeper.

Guided learning can make your experience as smooth as possible, including step-by-step, self-paced tutorials and videos to help you get started with Hadoop. Without time or data restrictions, you can schedule your own time to experiment on a large amount of data. Watch the video, learn the tutorial (PDF), and download BigInsights quick Start Edition now.

In order to perform these steps, you need to obtain the IBM Infosphere®biginsights™quick Start Edition. (You need to own or register an IBM generic ID before you can download InfoSphere BigInsights Quick Start Edition.? ）

There are two versions available for download. The first is the Quick Start Native software Edition; use it to run the software on your own machine. Another version is the Quick Start Edition VMware image (fused for MAC users); you need VMware Player to work with this virtual image. I am using the version of VMware Player on the CentOS 6.4 Linux® desktop.

Import data

First download a. zip file containing 4 CSV files from the World Bank website. See the Downloads section of the file.

Then, start InfoSphere BigInsights. (In the VMware Player version, just click the icon to start InfoSphere BigInsights, and you'll start and run Hadoop right away.) Click the shell folder in InfoSphere BigInsights and click Terminal to start building your directory for this use case. Listing 1 shows the command line.

Listing 1. Create a folder structure for use cases

$ sudo mkdir worldbank$ cd worldbank$ Sudo mkdir data$ cd data$ wget http://microdata.worldbank.org/index.php/catalog/1788 /download/28424/wld_1998_fip_v01_m_csv.zip$ ls$ CD wld_1998_fip_v01_m$ ls$ unzip wld_1998_fip_v01_m_csv.zip$ ls

If the operation is correct, you should get four CSV files. Now, load the data into Hadoop. In InfoSphere BigInsights, you can have several loading methods. The distributed File Copy application makes loading easier, but in this example, use the following Hadoop Shell command:

$ Hadoop fs-ls$ Hadoop fs-mkdir

Now that you have created the World Bank directory in HDFS, you can add the CSV file to Hadoop using the following code:

$ Hadoop fs-copyfromlocal/home/biadmin/worldbank/data/wld_1998_fip-v01_m/*.csv/user/biadmin/worldbank$ Hadoop FS- Ls/user/biadmin/worldbank

The four CSV files are now in HDFS. The next step is to transform those files with Hive.

Designing Hive Patterns and databases

First, you need to understand how the data is laid out in these CSV files. In the Linux terminal window, enter the following command:

$ head-2 Finance_inequality_and_the_poor_data_6005.csv

head -2command allows you to see the first two lines of the CSV file (the header plus the first line). You can see 17 columns, most of which are in numeric or decimal format, which makes it easier. The only exception is that Countrycode this is a string-like character. Next, look at the other files. In Linux, you can compare files (and so on) in different ways, diff but for now, just use the following command to view column names and data:

$ head-2 Finance_inequality_and_the_poor_data_8005.csv

The second file has more columns. In addition to CountryCode, there are country and year columns. Only 13 columns can be seen. When you compare this file with the *. 6005.csv file, it seems that some other columns are missing. When you continue to complete this process, you will see that the four CSV files are different, which makes it more challenging to normalize and summarize data.

Determine how your database is modeled. As you can see, you will get four core tables ( 6005 , 8005 , data_panel and data_poverty ). After you create and load these tables, you need to create an anti-normalized primary table. In this example, the table is called Finance_Inequality . This table consolidates all of the data from four tables. (Obviously, some columns will be left blank.) ）

After creating and loading the primary table ( Finance_Inequality ), the final step is to summarize a column and establish a Finance_Inequality_Transformed table. This is a simple use case, but as you can see, it's not hard to imagine what a production-level system Hive uses to build complex summary and conversion tables.

Look at the 6005 columns you must create for your table:

countrycode
year
loginitialgini
growtheingini
span
loginitialgdppercapita
growthgdppercapital
privcreavg
logprivatecredit
inflation
logtrade
gr_ltrade
gr_school
logschooling
logcommercialcentralbank
loginitiallowestincomshare
growthinlowestincomeshare

This list makes it easy to understand what needs to be done and what to create. In addition to CountryCode, the 6005 table will have 17 digits or a decimal column. Next, we're going to start building the database.

Click InfoSphere BigInsights Shell, and then click Hive Shell.

I usually create a text file for each table, write SQL and Data manipulation Language (DML) code in the file, save the table, and then paste the code into the Hive shell. This approach is much easier than typing SQL and DML code line-by-row in the command shell.

Listing 2 provides the SQL code to create the database and the first 6005 tables.

Note: in the InfoSphere BigInsights version I used, DECIMAL I couldn't work, so I had to use it Double .

Listing 2. Create a database and SQL code for 6005 tables

Show Databases;create database Worldbank_finance_inequality;use worldbank_finance_inequality; Create table IF not EXISTS worldbank_finance_inequality.tbl_6005 (CountryCode String, year int, Loginitialgini double, gro Wtheingini double, span int, loginitialgdppercapital double, growthgdppercapital double, Privcreavg double, Logprivatecredit double, inflation double, logtrade double, Gr_ltrade double, Gr_school double, logschooling double, Logco Mmercialcentralbank Double, loginitiallowestincomshare double, growthinlowestincomeshare double) ROW FORMAT Delimited Fields TERMINATED by ', ';

Now, create the table in a similar way tbl_8005 , tbl_data_panel and tbl_data_poverty .

After you have created four tables, you are ready to populate them with the following commands:

LOAD DATA inpath '/user/biadmin/worldbank/finance_inequality_and_the_poor_data_6005.csv ' OVERWRITE into TABLE Worldbank.finance_inequality.tbl_6005;select * from worldbank.finance_inequality.tbl_6005 limit 10;select count (*) From worldbank.finance_inequality.tbl_6005;

After populating the table and running some simple select statements, the data should be correct. You can delete the header files before loading them into an HDFS or Hive table.

Note: keep in mind that when you create a table, you include a delimited field with a single quotation mark ( ‘ ) to terminate the row format. If this critical section is omitted, your data will not be loaded correctly.

Building a Master finance_inequality table to consolidate data

After building the four core tables, the next step is to build the main table and consolidate the data from all four tables. This task may be somewhat complex, depending on your source code. For this use case, a left outer join and a right outer join are sufficient to create the main table. Listing 3 shows the relevant code.

Listing 3. Code to build the main finance_inequality table

create table worldbank_finance_inequality.master_tbl_finance_inequality as SELECT A.countrycode, B.country, A.year, C.timeperiod, A.loginitialgini, A.growthingini,a.span, A.loginitialgdppercapita, A.growthgdppercapita, A.privcreavg, A.logprivatecredit, A.inflation, A.logtrade, A.gr_ltrade, A.gr_school, A.logschooling,a.logcommercialcentralbank, A.loginitiallowestincomeshare, A.growthinlowestincomeshare, D.logagedependency, D.loginitialheadcount,d.loginitialpovertygap,d.growthinheadcount,d.growthinpovertygap, D.growthinmeanincome,d.populationgrowthfrom worldbank_finance_inequality.tbl_6005 A left OUTER JOINWorldBank_ finance_inequality.tbl_8005 B on a.countrycode = B.countrycode right OUTER joinworldbank_finance_inequality.tbl_data_  Panel C on a.countrycode = C.countrycode left OUTER joinworldbank_finance_inequality.tbl_data_poverty D on A.countrycode = D.countrycode;

After you run this code, there will be 315 rows or 314 rows, depending on whether you discarded the header row. With this main table, you can combine, summarize, delete, and whatever you want to do with the data. Obviously, with more in-depth research, you'll find data quality issues or duplicate data, but for this use case, you've completed the ELT using Hive and built a master database through the transformation effort.

Summary table

You have a number of options in order to summarize or perform analysis on data in a denormalized primary table. You can combine columns, drop columns, add, subtract, multiply, or divide several columns, and generate your own derived columns in a new table??。 You need to have a policy. Perhaps you want to find the correlation between population growth and the gap between rich and poor. You can run complex Hive SQL queries to get these results, but if you want to make yourself or your users ' work easier, consider creating a summary table.

The process of creating a summary table is similar to creating a primary table. In Hive, you can use options, for example, if A is present ... if a condition B is present ... Kind of, and so on. Hive has its limitations compared to traditional RDBMS, but with the right ideas and preparation, you can create summary queries or create more complex derived tables.

Run a query

Use the code in Listing 4 to run a query.

Listing 4. Run a query in the Hive shell

Use Worldbank_finance_inequality;select CountryCode, country, case if year > 1990 then year ELSE 0 END as Theyear, PO Pulationgrowth, Growthinpovertygapfrom master_tbl_finance_inequality;

Conclusion

Obviously, there is a need to think twice about choosing between ELT or ETL concepts. For many data warehouses, master data management, and other database projects, this decision may take more than 70% of the planned time required. Effective data analysis requires the right data. Without the right data, you can't get an accurate analysis.

In this article, our use case shows that it is easier to dump the data into HDFS and then consider the schema. This article begins the Hive database after data has entered HDFS, and the sample data is from the Web CSV file. But the data can come from any source and can take any format. At the end of the example, we identified the layout, delimiters, variables, and other factors for the files, and then set up the database for that data and ran the query.

Hive certainly has its limitations, but if you are in the Hadoop ecosystem and already understand SQL, you can use this wonderful tool to start building databases, table streams, transformations, and data integration. This is a relatively simple use case, and more complex processes can be established in Hive and Hadoop.

Using Hive as an ETL or ELT tool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Hive as an ETL or ELT tool

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support