Large data age: integrating large data and data warehouses

Source: Internet
Author: User
Keywords Data Warehouse Data Warehouse workload data Warehouse workload data processing warehouse workload data processing can data Warehouse workload data processing can different

Integration Strategy

Data integration refers to the combination of data from different systems for business users to study different industry behaviors and customer behavior data processing methods. In the early days of data integration, data was limited to trading systems and their applications. The development of business decision is guided by the decision platform, and the limited dataset provides the foundation of creating the decision platform.

 

Data capacity and data types have grown dramatically over the past 30 years, with data warehousing technology from scratch, and infrastructure and technology to meet analysis and data storage requirements. All this has revolutionized the prospects for data integration.

Traditional data integration technology focuses on the ETL, ELT, CDC and EAI types of architecture and related programming models. However, in large data environments, these technologies need to be modified based on requirements such as size and complexity, including data formats to be processed. It takes two steps to achieve large data processing. The first step is to implement the data-driven architecture, which includes the analysis and design of data processing. The second step is the physical architecture implementation, which we will introduce in the following sections.

Data-driven integration

In the technique of building the next-generation data Warehouse, all the data in the enterprise are sorted first according to the data type, taking into account the nature of the data itself and the related processing requirements. Data processing will use business rules built into the process logic and integrated into a series of programming processes, using enterprise metadata, MDM, and semantic technology (participle technology).

Figure 10.3 shows the process of data entry data processing for the various types. The model first divides data types based on the format and structure of data, and then carries out rules processing at various levels in the ETL, ELT, CDC, or text processing techniques. Next, let's analyze the data integration architecture and its benefits.

Figure 1

Data classification

As shown in Figure 1, the data can be roughly divided into the following categories:

Transaction processing data. such as typical OLTP data.

Web application Data. such as the data generated by the Web application that the organization develops. This data includes click Stream data, Web sales data and customer relations and call center call data.

Edw data. This is the existing data from the organization's current data warehouse. It may include various data warehouses and data marts in the organization that store and process data for business users.

Analyze the data. These data are derived from the analysis system currently deployed by the organization. These data are now mainly based on EDW or transactional data.

Unstructured data. This large category includes:

Text: Documents, notes, notebooks, and contacts

Images: photos, charts, and graphs

Video: Corporate and customer videos related to organizations

Social media: Facebook, Twitter, Instagram, LinkedIn, forums, YouTube and community sites

Audio: Call center call, broadcast

Sensor data: Includes sensor data from various devices related to the business scope. Energy companies, for example, produce data for intelligent measuring instruments, while logistics and distribution providers (UPS and FedEx) generate data for truck and car sensors.

Weather data: Modern business-to-business and Business-to-consumer companies use weather data to analyze the impact of weather on business; it has become an important element of predictive analysis.

Scientific data: Applied to medical, pharmaceutical, insurance, medical and financial services, these areas require complex data computing capabilities, including modelling and generation models.

Stock market data: Many organizations use it to process financial data, forecast market trends, financial risks, and perform actuarial calculations.

semi-structured data. This includes e-mail, presentations, mathematical models, graphics, and geographic data.

Schema

After identifying and organizing different data types, you can clearly identify various data characteristics-including data types, associated metadata, important data elements that can identify the primary data element, data complexity, and business users who own and manage the data.

Workload

The biggest requirement for processing large data is workload management as described in the previous section.

Figure 2

With the data architecture and classification, we can allocate the infrastructure that can perform the workload requirements for this type of data.

We can roughly divide the workload into 4 categories based on data capacity and data latency (Figure 2). We then assign the data to the physical infrastructure layer based on the category. This management approach creates a dynamic, scalable requirement for the various parts of the data warehouse that can efficiently take advantage of current and future new base methods. The key issue to be aware of at this point is to maintain the flexibility of the processing logic so that it can function on different physical infrastructure components, because the data is categorized according to the processing urgency, so that the same data can be grouped into different workloads.

The workload schema will further determine the conditions for mixed workload management, and data from different workloads will be processed together.

For example, typically, we only need to work with one type of data and its payload in an environment, and the data processing environment will be under a variety of pressures if both high capacity, low latency data and low capacity high latency data are processed together. Simultaneous or High-frequency user queries and data loading further increase the complexity of data processing, which can quickly lose control and then affect overall performance. If an infrastructure processes both large and traditional data, plus these complexities, the problem is even worse.

The goal of dividing workloads is to determine the complexity of data processing and how to reduce the risk of infrastructure design for next-generation data warehouses.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.