How to overcome the cloud data warehouse data migration problem?

Source: Internet
Author: User
Tags file system sqoop

Cloud computing and data warehousing are a reasonable couple. Cloud storage can be scaled on demand, and the cloud can contribute a large number of servers to a specific task. The common function of Data Warehouse is the local data analysis tool, which is limited by calculation and storage resources, and is limited by the designer's ability to consider the new data source integration. If we can overcome some of the challenges of data migration, the problem can be solved by moving a data warehouse and its data analysis tools from dedicated servers in the datacenter to cloud-based file systems and databases.

Cloud data management typically involves loading and maintaining files in a distributed file system, like the Hadoop Distributed File System (HDFS), and then processing data with tools such as MapReduce. For data warehouses and other analytic characters, database tools such as hive provide SQL-like functionality on top of distributed file systems.

Although the traditional relational database management system and cloud non relational database can be described in parallel, but when the data transfer between two fragments, the different operation mode can cause problems. Extracting, converting, and loading processes can even create more challenges.

Data Migration Tool assists in migrating to the cloud

It is easy to extract data from a database, and it is a challenge to efficiently mine large volumes of data from a database. If the data warehouse is experiencing performance or storage problems due to the increase in data volume, it may be time to consider using shipping resources. The following tools are available to assist in loading data from relational databases to cloud file systems and databases.

Professional tools, such as Sqoop (sql-to-hadoop) generate code to extract data from a relational database and copy it to HDFs or hive. Sqoop uses JDBC drivers to work with multiple types of relational databases, but the introduction of large amounts of data through JDBC results in performance costs.

In order to migrate to the cloud, you may need to transform data when extracting data from a relational database. If all the data you're working on is from a single database, you can convert it in the source database. If data is merged from two separate systems, it is more efficient to transfer the data source after extraction. However, you should do this before loading the data into the final data repository. The cascading data processing API can assist in this task.

Cascading provides functions that run on top of Hadoop, such as workflow processing, scheduling, and scheduling. For example, it works with pipe filters, and data application filters are piped from one source to the destination. Other features like grouping can be applied to data streams. Cascading is implemented in Java and invokes the conversion API in MapReduce work.

If you're working with MySQL, sqoop mind using the MySQL dump feature to bypass JDBC and extract data more efficiently. Sqoop can also generate Java classes, which can be used to manipulate loading data and import it directly into hive. Hiho (Hadoop input and Output) extracts data from relational tables and provides some basic transformation services, such as going heavy and merging input streams.

When a makefile is required to minimize conversion before it is loaded into the HDFs file system or hive Data Warehouse, you can load the file directly. After determining the target table and the split specification, Hive has a command to load the data. Pig is a high-level language for data analysis programs, especially when compared to mapreduce coding in Java. It provides a basic statistical function that you can find in relational databases (like Min, Max, count) and in math and string processing. Pig supports the compression of both structured and unstructured text files.

Cloud computing resources complement the data Warehouse infrastructure. However, to maximize the benefits of transferring data warehouses to the cloud, it is important to properly structure the data and use the right data analysis tools.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.