One.
The Data Warehouse architecture is a branch of the IT architecture, and the architecture of the data warehouse becomes increasingly important as the data is enhanced at the core of the enterprise. Data Warehouse architecture because of its wide range of technology choices, looks complex, but behind a set of relatively stable thinking, which is also a key point of data Warehouse architecture design, stability contains changes, the changes contain stability.
In general, the Data Warehouse architecture is divided into two chunks, one is hardware architecture and the other is software architecture. The hard and soft architecture can also be divided into closed and open. The enclosed hardware architecture represents the vendor with Teradata, whose hardware is proprietary and must be run with special hardware. The Open hardware architecture represents Oracle and can be run on a variety of hardware, but the boundaries between open and closed are gradually converging, and Oracle is also packaging HP's proprietary hardware to promote its DW solution. Teradata is also starting to provide its DW products on hardware that is running on SUSE-based OS. Closed hardware benefits are out-of-the-box, through the strict testing of manufacturers, supportability is relatively high, open hardware needs to have a strong technical strength, can have a hardware, storage, operating system integrated knowledge and ability of the team, in a set can run the basis of DW software platform, And when the problem is found to be able to quickly locate the cause of the problem and solve.
The choice of software architecture for the Data Warehouse is richer. From database software, ETL software, presentation software, data mining software, each type has a lot of choices.
The choice of these software is part of the architecture design, an important part of the architecture design is a set of ideas to synthesize these software, in a set of DW architecture design ideas, software can be very flexible to choose.
Two.
What is the starting point of the Data Warehouse architecture design? What issues need to be addressed?
The so-called architecture, like the building, good design building has a good seismic, anti-natural disaster ability, frame-type building can re-build the internal structure. and the Data Warehouse architecture is to solve similar problems, in fact, many data warehouses in the beginning of the time is not talking about architecture, is a small workshop, no need to talk about the height of the architecture. But if you want to consider building a business that can support 5-10 years, the structure is good or bad.
A good architecture is actually the sediment of experience, architecture is the basic task of understanding the Data Warehouse, and can make these tasks efficient and low-cost implementation. As a simple example to understand, there are many dependent modules for synchronizing data and summarizing data in a data warehouse, and if some of the modules fail, what should be handled? If the design of the structure is not good, will be caught in the maintenance personnel constantly looking for problems, clean the scene, manual scheduling problems appear, the scene should be very chaotic. A good architecture is first modular, the module has automatic clean-up field function, and the module has automatic breakpoint Restart function, in the module when the general error, can rely on the system self-help to solve the problem, while the process of processing problems can be recorded for subsequent analysis. such a structure can greatly improve the efficiency of maintenance, reduce maintenance personnel maintenance. The entire DW system also has the ability to resist anomalies.
Three.
The architecture design of data Warehouse, sometimes the starting point of a good architecture design often originates from the defects of current system. How to face the defects of the current system is one of the key points of the sustainable development of architecture. There are many business, open source ETL tools in the industry evaluation, then these evaluation points should be taken to identify the appropriate tools for the enterprise?
1. Costs. Cost is always a core concern of enterprises, especially in today's economic winter, more so.
2. Efficiency. The ability to efficiently handle massive amounts of data is a fundamental element, and the data warehouse knows that the amount of data is always a topic to be discussed.
3. Linear expansion. Systems that support linear scaling are particularly important in systems that have been planned for years, making it easy to make annual budgets.
4. Work together. Solve the problem of multi-person collaborative development.
5. Scheduling. It is convenient to see the overall scheduling at a glance, standing at a very high level to manage various data streams.
6. Compatibility. Compatible with a variety of heterogeneous data.
7. Accurate monitoring System.
8. An efficient development framework.
Four.
The physical architecture of the data warehouse, including the hardware physical architecture and the software physical architecture. The hardware physical architecture contains both centralized and distributed 2, which are used within the enterprise.
The centralized hardware physical architecture is biased towards the use of very power minicomputer or mainframe, very high-end mass storage, simple management, and without any input performance can meet enterprise demand.
Distributed hardware physical architecture is currently very popular, characterized by low-cost lower-end machine composition computing clusters, different technologies driven by the share nothing in the framework can be used in the local hard disk, in the share everything architecture, biased to use centralized storage, The requirement of distributed cluster in network is higher, the expansibility is better, and the software with good can reach the requirement of linear expansion.
The main characteristics of a software physical architecture are row and column storage. This is also a lot of vendors have talked about the place, according to the different needs, 2 ways can be flexibly adopted.
Most of the DB software is in row storage, and column storage is characterized by efficient single-row value compression, when the selection of the column is less than the need for IO requirements very fast, but the row storage of the DB is now in the compression efficiency is also rapidly increasing, most of the demand is still select row data to observe, Row storage also makes it easier to parallelize table-by-record splits.
Design of Data Warehouse architecture