Data warehouse architecture

Source: Internet
Author: User

Introduction
The data warehouse architecture is a branch of the IT architecture. As the core role of data in enterprises increases, the data warehouse architecture becomes increasingly important. The data warehouse architecture seems complicated because of its wide selection of technologies. However, there is a set of stable ideas behind it. This is also a key point in the design of the data warehouse architecture, which contains changes in stability, the changes are stable.
In general, the data warehouse architecture is divided into two major parts: one is the hardware architecture and the other is the software architecture. The hard-soft architecture can be further divided into closed and open architectures. The closed hardware architecture represents that the vendor has teradata, and its hardware is exclusive and must use special hardware to run. Oracle represents an open hardware architecture that can run on a variety of hardware. However, the boundaries between open and closed architectures are gradually integrated, oracle began to package hp's dedicated hardware to promote its dw solution, and teradata began to provide its dw product on suse-based OS-running hardware. The advantage of closed hardware is that it is out-of-the-box. It has been strictly tested by the manufacturer and has a high security level. Open Hardware requires the enterprise to possess powerful technical strength and can have a hardware, storage, A team with comprehensive knowledge and capabilities of the operating system is combined into a basic platform that can run dw software, and can quickly locate and solve the cause of the problem when identifying the problem.
The software architecture of the Data Warehouse is more diversified. Database software, etl software, presentation software, and data mining software have many options for each type. The selection of these software is part of the architecture design. The core part of the architecture design is a set of ideas integrated with these software. Under the dw architecture design philosophy, the software can be selected flexibly.
The main feature differences between the software physical architecture are Row Storage and column storage. This is what many vendors once talked about. The two methods can be used flexibly based on different needs. Most db software uses row-based storage, and column-based storage features efficient single-column value compression. When selecting fewer columns, the io requirement is very low and the speed is fast, however, the compression efficiency of Row-store databases is also rapidly improved. Most of the requirements are to select row data for observation. Row-store also makes it easier for the table to be split by record for parallelization.
Yahoo Data Warehouse
Yahoo Data Warehouse consists of hadoop clusters and Oracle clusters in its infrastructure. hadoop clusters are a computing platform that completes all ETL data processing processes. Oracle clusters are just a query environment.
Data is loaded from the source system into the ODS layer of the Data Warehouse through Data highway. The Data in the ODS layer remains the same as that in the source system. The EDW data layer does not have a strict logical subdivision of the data layer. It may have multi-layer ETL processing and multi-layer data storage. This layer of data mainly uses the dimensional modeling method to establish a data model based on application requirements. Data is stored in a column-based data structure. After the data is processed, the data is synchronized to the Oracle cluster for data query.
Yahoo uses Oracle as the query environment. They use a large number of time RANGE partitions and HASH subpartitions to improve query response performance (similar to Greenplum ). The data adopts the compression technology, and ORACLE has customized some improvements for them based on the compression and reading methods to obtain better read IO and compression capabilities. The MSTR report tool connects to ORALCE to complete most of the REPORT query functions. In addition, if you want to query the most detailed data, the tool connects to the HADOOP cluster and creates temporary tables to meet the query functions. At the same time, Yahoo's warehouse is equipped with a powerful metadata management system, whose metadata is directly parsed into the metadatabase through SQL parsing, MAPPING at the field level. At the same time, their PM will maintain the latest business metadata (business rules, indicator definitions) into the metadatabase system.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.