How to overcome the cloud data warehouse data migration problem?

Last Update:2015-03-23 Source: Internet

Author: User

Keywords Data Warehouse you can

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cloud computing and data warehousing are a reasonable couple. Cloud storage can be scaled on demand, and the cloud can contribute a large number of servers to a specific task. The common function of Data Warehouse is the local data analysis tool, which is limited by calculation and storage resources, and is limited by the designer's ability to consider the new data source integration. If we can overcome some of the challenges of data migration, the problem can be solved by moving a data warehouse and its data analysis tools from dedicated servers in the datacenter to cloud-based file systems and databases.

Cloud data management typically involves loading and maintaining files in Distributed file systems, like the Hadoop Distributed File System (HDFS), and then processing data with tools such as MapReduce. For data warehouses and other analytic characters, database tools such as hive provide SQL-like functionality on top of distributed file systems.

Although the traditional relational database management system and cloud non relational database can be described in parallel, but when the data transfer between two fragments, the different operation mode can cause problems. Extracting, converting, and loading processes can even create more challenges.

Data Migration Tool assists in migrating to the cloud

It is easy to extract data from a database, and it is a challenge to efficiently mine large volumes of data from a database. If the data warehouse is experiencing performance or storage problems due to the increase in data volume, it may be time to consider using shipping resources. The following tools are available to assist in loading data from relational databases to cloud file systems and databases.

Professional tools, such as Sqoop (sql-to-hadoop) generate code to extract data from a relational database and copy it to HDFs or hive. Sqoop uses JDBC drivers to work with multiple types of relational databases, but the introduction of large amounts of data through JDBC results in performance costs.

In order to migrate to the cloud, you may need to transform data when extracting data from a relational database. If all the data you're working on is from a single database, you can convert it in the source database. If merging data from two separate systems, it is more efficient to transfer the data source after extraction. However, you should do this before loading the data into the final data repository. The cascading data processing API can assist in this task.

Cascading provides functions that run on top of Hadoop, such as workflow processing, scheduling, and scheduling. For example, it works with pipe filters, and data application filters are piped from one source to the destination. Other features like grouping can be applied to data streams. Cascading is implemented in Java and invokes the conversion API in MapReduce work.

If you're working with MySQL, sqoop mind using the MySQL dump feature to bypass JDBC and extract data more efficiently. Sqoop can also generate Java classes, which can be used to manipulate loading data and import it directly into hive. Hiho (Hadoop input and Output) extracts data from relational tables and provides some basic transformation services, such as going heavy and merging input streams.

When a makefile is required to minimize conversion before it is loaded into the HDFs file system or hive Data Warehouse, you can load the file directly. After determining the target table and the split specification, Hive has a command to load the data. Pig is a high-level language for data analysis programs, especially when compared to mapreduce coding in Java. It provides a basic statistical function that you can find in relational databases (like Min, Max, count) and in math and string processing. Pig supports the compression of both structured and unstructured text files.

Cloud computing resources complement the data Warehouse infrastructure. However, to maximize the benefits of transferring data warehouses to the cloud, it is important to properly structure the data and use the right data analysis tools.

TechTarget Chinese original content, original link: http://www.searchcloudcomputing.com.cn/showcontent_58751.htm

(Responsible editor: Lu Guang)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More