The project was completed years ago, and there are a lot of small questions to look back at. It's a little messy. or from the need to talk about.
First, the demand for each industry is different. Difficult to unify. In general, these are the areas
1. Time Window
Common classifications also for 1 categories of ODS, type II ODS, category III ODS
Class I ODS: Data latency with application systems is 1-2 seconds, real-time or approximate real-time
Class II ODS: 2-4 hours of data delay with the application system
Category III ODS: 12-24 hours of data latency with application systems
Category IV ODS: Part of the decision analysis data in the Data warehouse is returned to ODS
The higher the data real-time, the better the CPU, the higher the cost of software. The selection is also different,
If it is determined that real-time data needs to be synchronized in real time, it is the Class I ODS, which usually requires the mechanism of EAI, Message Queuing and message communication. A little bit almost can use some of the advanced features of the database, such as Oracle from the redo log extract, the current support vendors are many, the bottom line is to use database triggers, the workload is very large, are some boring repetitive code, reusability is not high.
Class II ODS this seems to be a little more, before the bank transfer has a few hours after the business of the account. It is now very rare to build such an estimate using a higher performance Class III ODS.
Category III ODS is a very common, often said ETL, that is, batch data processing is such a must match items. Manufacturers are also many, but to be measured in terms of ease of use, performance, and the combination of local databases.
We use this framework. The use of software ORACLE,IBM is basically also large manufacturers.
Category IV ODS is generally the data that is aggregated on ODS data. A friend who does data analysis, dealing with such systems, such as sas,spss,r.
2. Data volume level
Any data as long as the magnitude comes up, it is very difficult. We've done test data throughput at the G level, and using a traditional database can barely be done. If you exceed this level, no matter in etl,dataanylse you are not from the heart.
There is a need to use big data architectures, and not full use of big data, but a combination of big Data + traditional databases. We are currently testing this program. Many of these architectures have to be changed, the more fatal is the ETL becomes more complex, the traditional ETL tools many have not followed.
If the amount of data is higher than the PB level, all the previous architectures have to be pulled back, using a pure Big data architecture, which is not a common company can do. Don't talk about it for the moment.
3. Data Attribute Validation
This takes up a lot of our work on ODS modeling (similar to BI modeling),
Dimensional data and fact table data (log data) are important guarantees that we do not deviate from the business.
Data sources (JMS,DATABASE,FILE,EAI) which involve the processing of different technologies.
Data processing (statistical, non-statistical): is the key to affect the performance of ETL.
Ext.: http://www.cnblogs.com/jerryxing/archive/2013/02/20/2918130.html
OLAP--The ODS project summary--key in BI