1. What is ETL?
etl,extraction-transformation-loading's abbreviation, Chinese name for data extraction , conversion and loading. ETL is responsible for the distributed, heterogeneous data sources , such as relational data, flat data files , and so on to the temporary middle layer for cleaning, transformation, integration, and finally loaded into the Data Warehouse or data mart , it becomes the basis of on-line Analytical processing and data mining . ETL is the most important part of BI project, usually ETL will spend 1/3 of the time of the project, ETL design is directly connected to the success or failure of the BI project. ETL is also a long-term process, only to constantly identify problems and solve problems, to make ETL run more efficient , for the development of the project to provide accurate data.
As the core and soul of BI/DW (Business Intelligence), ETL can integrate and improve the value of data according to unified rules, and it is the process of transforming data from data source to target data Warehouse, which is an important step to implement data warehouse. If the Data warehouse model design is a building design blueprint, data is brick, then ETL is the process of building a building. The most difficult part of the whole project is user demand analysis and model design, and ETL rules design and implementation is the most workload, accounting for the whole project 60%~80%, which is the general consensus from many practices at home and abroad.
ETL is the process of data extraction (Extract), cleaning (cleaning), transformation (Transform), loading (load). is an important part of building a data warehouse, the user extracts the required data from the data source, after the data cleansing, finally according to the pre-defined Data warehouse model, the data load into the Data warehouse.
In technology, ETL is mainly related to correlation, transformation, increment, scheduling and monitoring, and so on, data Warehouse system data does not require in-line transaction processing system data in real-time synchronization, so ETL can be timed. However, the operation time, sequence and success of multiple ETL are critical to the validity of information in Data Warehouse.
ETL tools are: OWB (Oracle Warehouse Builder), ODI (Oracle Data Integrator), Informatic PowerCenter, Trinity, Aicloudetl, DataStage, Repository Explorer, beeload, Kettle, Dataspider.
Typical representative of ETL tools are: Nformatica, Datastage, OWB, Microsoft DTS, beeload, kettle ...
2. What is a data warehouse?
(1) Definition:
A data warehouse is a structured data environment for decision Support System (DSS) and online analytics application data sources. Data warehousing research and solve the problem of getting information from a database. The Data warehouse is characterized by its theme-oriented, integration, stability, and time-varying nature.
The Data Warehouse, submitted by Bill Inmon, the father of the Data Warehouse in 1990, is the main function of the organization through the information system of online transaction processing (OLTP) accumulated a large amount of data, through the Data Warehouse theory unique storage architecture, for a systematic analysis and collation, Bill Enmen Benefit from various analytical methods such as online analytical processing (OLAP), data Mining, and support the creation of decision support systems (DSS), Supervisor Information Systems (EIS) to help decision makers quickly and efficiently analyze valuable information from large volumes of data, Facilitate decision-making and rapid response to external environmental changes to help build business intelligence (BI).
In 1991, Bill Inmon, the father of the Data Warehouse, Bill Enmen the definition presented in the book "Building The Data Warehouse", the Data Warehouse, which is widely accepted- Warehouse) is a subject-oriented (Subject oriented), integrated (Integrated), relatively stable (non-volatile), data collection reflecting historical changes (time Variant), Used to support management decisions (decision Making supports).
2. Features of the Data Warehouse
1, the Data Warehouse is subject-oriented; The data organization of the operational database is oriented to the transaction processing task, and the data in the Data warehouse is organized according to a certain subject domain. A topic is a focus that users care about when they use a data warehouse to make decisions, and a topic is often associated with multiple operational information systems.
2, the Data Warehouse is integrated, data warehouse data from the decentralized operation of data, the required data from the original data extracted, processing and integration, unified and integrated before entering the Data Warehouse;
The data in the Data warehouse is processed, summarized and collated on the basis of the original scattered database data extraction and cleanup, and the inconsistency in the source data must be eliminated to ensure that the information in the Data warehouse is the consistent global information about the whole enterprise.
Data warehouse data mainly for enterprise decision-making analysis, the data involved in the operation is mainly data query, once a data into the Data warehouse, the general situation is long-term reservation, that is, the data warehouse generally has a large number of query operations, but the modification and deletion operations are very few, usually only need to periodically load, refresh.
Data in the data Warehouse usually contains historical information, the system records the enterprise from a certain point in time (such as the beginning of the application of Data Warehouse point) to various stages of information, through this information, the enterprise's development process and future trends to make quantitative analysis and prediction.
3, Data Warehouse is not updatable, data Warehouse is mainly for decision-making analysis to provide data, the operations involved are mainly data query;
4, the Data Warehouse is changing with the time, the traditional relational database system is more suitable for the processing of formatted data, can better meet the needs of commercial business processing. Stable data is saved in read-only format and does not change over time.
5, summarized. The operational data is mapped into the format available for decision making.
6, large capacity. Time series data collections are usually very large.
7, non-standardized. DW data can be and often be redundant.
8. Meta data. Save the data that describes the data.
9, data source. The data comes from both internal and external non-integrated operating systems.
The Data warehouse is not a so-called "large database" in order to further excavate the data resources and make the decision-making because the database already exists. The purpose of the scheme construction of data Warehouse is to base on the front-end query and analysis, because of the large redundancy, the storage needs are also larger. To better serve front-end applications, data warehouses often have the following characteristics:
1. efficiency is high enough. Data Warehouse analysis data is generally divided into days, weeks, months, seasons, years, etc., it can be seen that the date of the cycle of data requirements of the highest efficiency, requiring 24 hours or even 12 hours, customers can see the previous day of data analysis. Because some enterprises daily data volume is very large, poorly designed data warehouse often problems, delayed 1-3 days to give data, obviously not.
2. data quality. Data Warehouse provides a variety of information, certainly to accurate data, but because the data warehouse process is usually divided into several steps, including data cleansing, loading, querying, presentation and so on, complex architecture will be more layers, then because the data source has dirty data or code is not rigorous, can lead to data distortion, Customers who see the wrong information can lead to the analysis of wrong decisions, resulting in losses, not benefits.
3. Extensibility. The reason why some large-scale data Warehouse system architecture design is complex, because of the expansion of the next 3-5 years, so that the future does not have to spend too quickly to rebuild the Data Warehouse system, can be very stable operation. Mainly embodied in the rationality of data modeling, the Data Warehouse scheme more out of the middle layer, so that the massive data flow has enough buffer, not much data volume, it will not run up.
As can be seen from the above introduction, the Data Warehouse technology can wake up the accumulated data of enterprises for many years, not only for enterprises to manage these massive data, but also to explore the potential value of data, thus becoming a communication enterprise operation and maintenance system of one of the highlights. Because of this,
Broadly speaking, the decision support system based on data Warehouse consists of three parts: Data Warehouse technology, on-line Analytical processing technology and data mining technology, in which the Data warehouse technology is the core of the system, in the later article of this series, around the Data Warehouse technology, introduces the main technology of modern data warehouse and the main steps of processing, Discusses how these technologies can be used in communication operations maintenance systems to help with operational maintenance.
4. Theme- oriented
The data organization of the operational database is oriented to the transaction processing task, and the data in the Data warehouse is organized according to the subject domain in each business system. The topic is corresponding to the application of the traditional database, is an abstract concept, is the abstraction that synthesizes, classifies and analyzes the data in enterprise information system at a higher level. Each topic corresponds to a macro analysis area. Data Warehouse exclusions provide a concise view of specific topics for data that is useless for decision making.
(3) Purpose of the Data Warehouse
In the environment of information technology and Data Intelligence , the Data Warehouse provides many cost-effective computing resources in the fields of hardware and software, Internet and intranet solutions and databases, and can save a great deal of data for analysis and use. and allows the use of multiple data access technologies.
Open Systems technology makes it more reasonable to analyze large amounts of data, and hardware solutions are more mature. The main technologies used in data warehouse applications are as follows:
Parallel
Computing hardware Environments , operating system environments, database management Systems , and all related database operations, query tools and technologies, applications, and more can benefit from the latest parallel achievements.
Partition
Partitioning makes it easier to support large tables and indexes, while also improving data management and query performance.
Data compression
Data compression reduces the cost of disk systems typically needed to store large amounts of data in a data warehouse environment, and new data compression techniques have eliminated the negative impact of compressed data on query performance. [1]
3. What is a data mart? Data mart
Part of the data that is isolated from the data warehouse for a specific application purpose or application scope can also be referred to as departmental or subject data (Subjectarea). In the implementation of the data warehouse can often start from a department data Mart , and then a few data marts to form a complete data warehouse. It is important to note that when implementing different data marts, field definitions of the same meaning must be compatible, so that the data warehouse can be implemented in the future without causing great trouble.
Well-known foreign garnter about the Data Mart product report, the Agile Business Intelligence products in the first quadrant are Qlikview, tableau and Spotview, all in-memory computing data mart products, in Big Data on the traditional The business Intelligence product giant posed a challenge. Domestic BI products started late, well-known agile business intelligence products have Powerbi, Yonghong technology Z-SUITE,SMARTBI, including Yonghong Technology Z-data Mart is a thermal memory data mart products. The domestic information is also a data Mart product system Integrator .
4. What is Data mining (dm--Mining)?
Data mining is the process of extracting latent useful information and knowledge from a large number of incomplete, noisy, fuzzy and random actual data, which is hidden in it, which people do not know.
What can data mining be used for?
Data mining can do the following seven different things (analysis method):
· Category (classification)
· Valuation (estimation)
· Prophecy (prediction)
· Affinity Grouping or Association rules (Affinity Grouping or association rules)
· Aggregation (clustering)
· Description and visualization (Description and visualization)
· Complex data type mining (Text, Web, graphic image, video, audio, etc.)
Big Data Related Technical note (i)