Model Selection and Construction of Cloud Data Warehouse

Source: Internet
Author: User
Keywords data warehouse data warehouse definition cloud data warehouse

As one of the most traditional data applications, data warehouse plays an important role in the enterprise. It is very important for data analysis to build and correctly configure data warehouse.

1. Data warehouse construction

There are many ways to build data warehouse (DW). Enterprises can choose according to their own needs.
1.1 construction plan
1) Business plan
Business plan is the most traditional one and the mainstream way in the past 20-30 years. The enterprise purchases several warehouses, including the integrated delivery of software and hardware. There are many typical products, most of which are internationally famous large factories, and some of them are domestic manufacturers.
2) Self built + open source
This is a common solution adopted by many Internet companies, which is completed by self building infrastructure + deploying open source software. The whole scheme is completely independent and controllable for the enterprise, but has high technical requirements for its own personnel. A typical product is Greenplum.
3) Cloud + open source
This is a variation of the previous solution, i.e. IAAs layer is provided by cloud vendors, and others are still self built. When the enterprise business has been put into the cloud, this scheme is often adopted for better data integration and convenient data migration.
4) DW cloud
Enterprises directly choose the cloud service of data warehouse instead of independent construction. This situation will be highlighted below.
1.2 scheme comparison
For the above four schemes, the comparison is made from the cost, operation and maintenance, delivery, expansion, performance and other perspectives.
Cost: it includes the cost of early purchase and later operation, and it also includes the conversion cost of personnel investment.
Operation and maintenance complexity: mainly for the operation and maintenance complexity assessment of the enterprise's own technical personnel.
Delivery speed: the overall delivery speed of the scheme, including the purchase and construction of infrastructure.
Scalability: including the capacity expansion and performance expansion of the warehouse.
Performance performance: the overall performance of the warehouse.
1.3 key comparison cost performance
As can be seen from the above figure:
Scheme 1 and scheme 2 are relatively fixed in cost and performance. In scheme 1, the cost is high, but the performance is outstanding; in scheme 2 (self built), both are medium.
In schemes 3 and 4, the cost and performance are in the same range, and the range is large. Scheme 3 mainly depends on the capability of the infrastructure provided by the cloud manufacturer. Scheme 4 relies on the capability of cloud vendors to store cloud. This also puts forward higher requirements for the choice of cloud manufacturers' products. This is explained below.
2. Cloud data warehouse
2.1 advantages of cloud solutions
Based on the above description, cloud service with data warehouse has many advantages, including:
Better price performance (whether it's early purchase or later operation)
Faster delivery (up to minute)
Better resiliency (expansion or compression, computing or storage)
Lower O & M complexity (no need for professionals)
Simpler data integration (if already on the same cloud)
Richer data Ecology (depending on cloud vendor products)
2.2 key factors of data warehouse
Data warehouse is different from transactional database. It is built to facilitate the analysis of massive data, rather than dealing with transactions. This means that data warehouse is often several orders of magnitude larger than its corresponding transactional database, and it may not be so important for some key features of transactional database (such as acid, response time, etc.). On the contrary, data warehouse has its own needs, which can also be used as a choice factor for cloud access.
1) Multiple data integration methods
Putting data into a warehouse and formatting it correctly is often one of the biggest challenges facing a data warehouse. Traditionally, data warehouse relies on batch extraction transformation loading job ETL. ETL jobs are still important, but there is also the ability to take data from streams and even allow you to perform queries directly on data that is not in the warehouse.
2) Support data multiple query
In addition to supporting typical batch queries, existing data warehouses also need to support queries such as ad hoc. MapReduce of traditional big data technology stack Hadoop is not suitable for such queries. Many data warehouses turn to large-scale parallel processing (MPP) databases, which are originally executed on multiple servers through parallel technology after breaking up the data. In addition, there is a parallel memory processing technology like spark to complete queries.
3) Standard data access
What language does the data warehouse support for query. Obviously, standard SQL is the most user-friendly way, which can significantly reduce the threshold of users. In addition, high-level languages such as Python, R, etc. can also bring more ways for users to access. But some rely on dialects, which requires careful evaluation. After all, the cost of migration is not small.
4) Flexible resource flexibility
Data warehouse is designed to deal with massive data, but its scale may vary greatly. In addition, the demand for computing resources will change with the business. Therefore, there is a high demand for the flexibility of the resources of cloud based data warehouse, which is also a great advantage compared with the traditional self built way. The resources here include not only computing resources, but also data storage resources. In addition, it is necessary to distinguish whether the separate provision of computing and storage is supported rather than tightly coupled.
5) Low operation and maintenance cost
Data warehouse is a complex system, from the bottom of the physical resources, operating systems, warehouse software, to the top of the data objects, access statements and so on. As a digital warehouse on the cloud, it needs to provide simple, flexible, automatic and even intelligent operation and maintenance capabilities, which is convenient for customers to use and thus saves users' comprehensive operation and maintenance costs.
6) Flexible use

Data warehouse itself is a resource intensive application. How to reduce the user's use cost is what cloud manufacturers need to consider. For example, it supports pause and resume functions, and supports independent expansion of computing and storage.

2.3 go to the cloud / how to choose?
Using data warehouse cloud service has many advantages. Do you want to go to the cloud? This needs to be determined by considering the following factors in combination with the needs of the enterprise.
1) Is there enough technology accumulation?
Data warehouse itself has a high technical threshold, even if you choose open source, you need to explore the accumulation process, unless you directly use external commercial products.
2) Is the cloud already in use?
If you are already a cloud customer, data integration from the cloud will be easier. Otherwise, loading data across the cloud or locally would be a big project.
3) Are availability requirements high?
There are great differences among enterprises in this respect. For example, enterprises attach more importance to availability, and cloud manufacturers / commercial products undoubtedly have advantages.
4) Is the data large?
One of the core difficulties of data warehouse is the supporting data scale. For example, the scale of enterprise data is very large, which will bring great challenges to the self built way.
5) Is there a strong demand for expansion?
For example, in the period of rapid development, the scale of data and the complexity of use vary greatly. Compared with the other three schemes, the cloud scheme undoubtedly has advantages.
6) Use features vary dramatically?
For example, the use of enterprise data is very uncertain, that is to say, the data warehouse is required to have good flexibility, and the capacity can be expanded and reduced flexibly according to the needs; even the query ability is also required. The three non cloud solutions are difficult to adapt to rapid change.
7) High short-term cost pressure?
The enterprise needs several warehouses, but in the short term, through self construction and outsourcing business, the investment is too large at one time, and cloud scheme can also be considered.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.