Does it match Oracle Exadata PK?

Source: Internet
Author: User


Data warehouse all-in-one

What is all-in-one machine?

Applicance does not have a general definition. It should have the following features. It should be a device specially designed for a specific application field. It is centrally optimized for specific purposes and provides a complete set of solutions in a specific field, which requires a low maintenance cost. For end users, the all-in-one machine should be able to quickly and easily install and meet users' needs through standard interfaces and very simple operations. The all-in-one machine is a Black Box. Users tell it what they want to do. The all-in-one machine can quickly feedback the results or answers to users. IPod is an example of a good all-in-one machine, which simplifies and thoroughly revolutionizes the digital entertainment field.

Netezza-real all-in-one machine in the data warehouse field

We are proud that Netezza's products are genuine and are all-in-one machines designed specifically for data warehouses. In the data warehouse field, many manufacturers have launched their own "All-in-One" products. Some products only provide software, and users need to integrate software and hardware on their own. Although some products combine software and hardware, they are not specifically designed and optimized for data warehouses. These products require a very complex and long Manual Optimization process, and the subsequent maintenance costs are also very high. Netezza is a real all-in-one machine because it solves the above problems. It is a close combination of software and hardware that seamlessly integrates database management systems (DBMS), servers, and Storage devices ). Excellent performance can be achieved without complex configuration and optimization. "Netezza" is a word in a certain Indian dialect. It means "RESULT" in English ". This name also perfectly reflects the characteristics of the Netezza all-in-one machine. Do you need results? You only need to raise questions.

Simplicity

A big difference between the Netezza and traditional data warehouse is its simplicity. This simplicity is reflected in all aspects.

· Simplicity of installation and deployment: from the external point of view, the Netezza all-in-one machine is a big box. Plug in the big box and configure the service IP address so that it can provide external services. Traditional data warehouses often require a great deal of effort in physical planning and design. This includes storage planning, network configuration, and software installation.

· Simplicity of management and maintenance: It sounds a bit incredible, but it does-Netezza almost does not need to execute any task performed by the DBA of a traditional data warehouse

O no index)

O performance tuning is not required)

O no storage management: No dbspace/tablespace planning and configuration, no redo/physical long planning and configuration, no table page/block/extent planning and configuration, no need for temporary tablespace Allocation and Monitoring, no RAID-level selection, no logical volume planning and Creation Time

O you do not need to configure operating system kernel parameters or maintain the recommended operating system patch level

O simple Data Partition policy: hash or random

The benefits of simplicity are enormous. This simplicity can save expensive DBA management and maintenance costs, and the resources saved can be invested in tasks that can create more commercial value than tedious DBA tasks. Is a very simple example of creating a database. It can be seen that the statement of Netezza is very simple. Of course, other data warehouse statements can be the same as those of Netezza, but the database created in that case will not be optimized much worse than the database created in Listing 1. The advantage of Netezza is that it can also create databases with good performance with simple statements (less management and maintenance. In some actual data warehouse data migration projects, thousands of rows of table creation statements (including partition and index) in other data warehouses can be converted to Netezza in just a dozen rows, in addition, it can achieve better performance. Because of the length, the example of table creation statements is not listed here.


Listing 1 statements for creating a database in a data warehouse

CREATE DATABASE TEST

LOGFILE 'e: \ OraData \ TEST \ LOG1TEST. ORA 'SIZE 2 M,

'E: \ OraData \ TEST \ LOG2TEST. ORA 'SIZE 2 M,

'E: \ OraData \ TEST \ LOG3TEST. ORA 'SIZE 2 M,

'E: \ OraData \ TEST \ LOG4TEST. ORA 'SIZE 2 M,

'E: \ OraData \ TEST \ LOG5TEST. ORA 'SIZE 2 M

Extent management local maxdatafiles 100

DATAFILE 'e: \ OraData \ TEST \ SYS1TEST. ORA 'SIZE 50 M

Default temporary tablespace temp TEMPFILE 'e: \ OraData \ TEST \ TEMP. ORA 'SIZE 50 M

Undo tablespace undo DATAFILE 'e: \ OraData \ TEST \ UNDO. ORA 'SIZE 50 M

NOARCHIVELOG

Character set WE8ISO8859P1;


 
Listing 2 statements for creating a database on Netezza

CREATE DATABASE TEST


 

Netezza All-in-One Architecture

In the previous introduction, users can regard the Netezza all-in-one machine as a black box. Then how does this black box provide high performance while maintaining simplicity? This requires us to open the black box and look at the unique architecture of the Netezza all-in-one machine.


It mainly includes four key components. SMP host, S-Blades, disk storage cabinet, and Network Structure

Netezza 1000

Netezza 1000 is a representative model of the Netezza all-in-one machine. Before Netezza was acquired by IBM, the model name was Netezza TwinFin.

· SMP hosts are two high-performance Linux servers, one of which is active and the other is a backup server. All requests of BI applications are submitted through the active SMP host. The SMP host compiles and generates the optimal executable code, and distributes the generated executable code to S-Blades for execution. Finally, collect and summarize the results returned by S-Blades and return them to the user.

· S-Blades is a smart processing node and the place where Netezza magic occurs. Each S-Blades is an independent server that contains a standard blade server and a database accelerator card unique to Netezza. The blade server and database accelerator are integrated using IBM's sidercar technology to make them both logical and physical. Each S-Blades node of Netezza 1000 includes two 4-core CPUs, four 2-core FPGA engines, and 16 GB of memory.

· The disk storage cabinet contains high-density and high-performance disks. Each disk contains a data slice of the table ). The data slices of a table on all disks form the complete data of a table. Each disk also contains data images on another disk. The disk array cabinet is connected to S-Blades through high-speed channels (3 Gb/s SAS.

· The network structure is not marked in. The main network connection lines are on the back of the Cabinet. All components of the Netezza 1000 all-in-one machine are connected through a high-speed network. There are two types of networks: IP network and SAS storage network. The IP network is used for data communication between SMP hosts and S-Blades nodes and between different S-Blades nodes. The protocols in the IP network have been deeply customized and optimized specifically for the Netezza application environment, supporting the simultaneous transmission of large data volumes between thousands of nodes. The SAS network connects the S-Blades node to the disk storage cabinet, enabling S-Blades to access data on the disk at high speed.


Netezza All-in-One Architecture

The AMPP (asypolicric Massively Parallel Processing) of Netezza is a two-layer structure designed to process large data volumes of multiple users. The AMPP structure is the perfect combination of SMP frontend and shared nothing MPP backend. The front-end is an SMP high-performance Linux host. Its main function is to provide external services through standard interfaces (SQL, ODBC, JDBC, and OLE DB. The SMP host is responsible for compiling query requests sent from applications, generating optimized executable code snippets, known as snippet, and distributing these code snippets to all S-Blades for parallel execution. When all S-Blades are executed, the SMP host summarizes the results and returns the final results to the application. The backend is composed of a large number of S-Blades, and the main data operation process is completed on S-Blades. S-Blades are independent of each other. Each S-Blades occupies its own disk and data slice does not affect each other during parallel processing. The advantage of this structure is that the performance can be linearly improved by adding S-Blades nodes and the disks they use. The number of S-Blades in Netezza 1000 can be expanded to 120. In the face of massive data volumes, this divide-and-conquer method can achieve immediate results. This structure also provides great flexibility by changing the disk, S-Blades, and memory ratio to create different models of Netezza. For example, if you increase the number of disks and decrease the number of S-Blades, the query performance of such an all-in-one machine is reduced, but the data capacity is increased. It can be used to store historical data.


FPGA and data stream processing

In terms of architecture, the flexible AMPP architecture is an important factor in the high performance of the Netezza all-in-one machine. Another decisive factor is the database accelerator card and data stream processing concept introduced by the Netezza all-in-one machine. These occur in S-Blades, which greatly enhances the data processing capability of the all-in-one machine. Next we will go deeper into the magic place-S-Blades to see its uniqueness. S-Blades includes one blade server (eight CPU cores) and one data accelerator card (eight FPGA cores ). Under normal circumstances, one S-Blades in the Netezza 1000 all-in-one machine manages eight data slices ). A cpu core and an FPGA core plus a data chip constitute a logical processing unit, called Snippet Processor. Each Snippet Processor is responsible for processing a data chip independently, in this way, when a query is run, eight such logic processing units in S-Blades concurrently process eight data slices.


Both the CPU and FPGA have a clear division of labor to process different stages of the task, forming a pipeline job that greatly improves performance. This section describes the division of components of a logical Processing Unit (1 CPU Core + 1 FPGA Core + 1 data chip) and how they work collaboratively. SMP host compilation generates executable code fragments and distributes them to S-Blades for execution. This code snippet actually contains two parts: the part used to configure FPGA parameters and the other part is the CPU executable program. After the FPGA has configured the parameters, it starts to execute according to the configured parameters. A data stream is created. The following uses an SQL statement to explain how data is operated in a stream.

SQL: select c1, c2, sum (c3) from t1 where c2 = 999 assume that table t1 contains 10 columns from c1 to c10, respectively.

1 FPGA reads all data blocks of table t1 from the data sheet to the memory. Part of the data in Table t1 is read here, and other data is stored on other data slices.

2. All data on the disk is compressed. The advantage of compression is that the disk IO can be effectively reduced. FPGA decompress data

3. FPGA performs data projection and only retains the columns that can be used in the operation. This can reduce the data size and make data transmission and processing more efficient. The effect is particularly significant for wide tables. When c1, c2, and c3 columns are selected in the SQL statement, the remaining columns are not used. In this way, only the remaining columns of the three columns are saved and discarded.

4. FPGA filters data and only keeps the data that the user should obtain. Here there are two layers of data to be filtered out due to query condition restrictions. c2 = 999 in SQL statements will play a role in this phase. Another layer of filtering is to filter out data that has not been submitted.

5. The CPU aggregates, links, and summarizes the filtered data. Then the result of the logical processing unit is returned. Sum (c3) in the SQL statement is executed at this stage.

6. the SMP host obtains the final result after receiving the results of all logical processing units and returns the result to the user.

The whole process is like an assembly line in a factory. When the data flows through this assembly line, the result is obtained. When there are hundreds of streamline lines working in the factory at the same time, the result is naturally fast.

Key Technical Features

Partitioned database and partition key

Netezza is a partitioned database. The data in a table is distributed across all data slices. The partition key is used to store a record on that data disk. There are two ways to define the partition key. One is to specify one or more columns as the partition key of the table when defining the table, and the other is to use the random method (round-robin) to partition records. Listing 3 lists its basic syntax. If a column is specified as the partition key, Netezza calculates the partition where a record data belongs Based on the hash algorithm specified. If no partition key is specified, the first column is automatically used as the partition key.


List 3 specify the partition key

Create table table_name (

Column_name1 data_type1

Column_name2 data_type2

Column_name3 data_type3)

[Distribute on (column_name1, column_name2 ,... ] |

[Distribute on random]


 

The selection of partition keys has a crucial impact on performance, which is one of the few places in Netezza that can be managed and tuned. Data should be evenly distributed across all data slices as much as possible. The following are some basic principles:

· Select a column with a large number of unique values as the partition key. The more unique values a column has, the more evenly distributed the data. Do not use a column of the bool type as the partition key. This will cause all data to be distributed only on two data slices.

· When associating tables, you should select all columns in the association condition as the partition keys for the two tables. In this way, the Association operation only takes place in the partition of the data sheet. Each data segment does not broadcast its own data to other partitions.

Compression

Unlike the traditional data warehouse, all user data in the Netezza machine is compressed and stored, instead of being compressed and stored as in the traditional data warehouse, this also reflects the simplicity of Netezza. Performance bottlenecks in data warehouses often occur on disks. The advantage of data compression storage is that it can reduce the disk I/O pressure, and the FPGA engine is responsible for extracting data into readable content. The compression of Netezza is completely transparent to users. It supports all data types and does not require any optimization or management. The compression algorithm divides records into different data column streams based on columns, compresses each column stream independently, but maintains the row structure during storage, this patented compression algorithm guarantees a compression ratio of 4 to 32 times, greatly reducing the disk I/O pressure.

Zone Maps

Zone Maps is a unique technology of Netezza. It enables data blocks to know whether the data block contains the data contained in the query before it is read from the disk, skip this data block if it is not included. This method greatly reduces the disk IO and greatly improves the query performance. The traditional data warehouse needs to read data blocks and then determine whether or not the data is needed. What is Zone Maps? In Netezza, data is stored on disks in the form of data blocks. The minimum unit for read/write operations on data is data blocks, and the size of each data block is 3 MB. Zone Maps stores the maximum and minimum values of each column in each data block of each table. By default, integer, date, and time columns generate Zone Maps statistics. The TradeTable table uses three data blocks to store data. Zone Maps collects the maximum and minimum values of two columns of Data and Cust_ID on each Data block. When the statement "select * from TradeTable where data = 02" is executed, the database system first checks Zone Maps. According to the Zone Maps instructions, only the data in the second data block has records that meet the conditions, so the data in the first and third data blocks will not be read.

 

Zone Maps are automatically maintained without user intervention. Zone Maps will be automatically updated under the following conditions:

· When you run the generate statistics command to collect STATISTICS

· When loading data with nzload

· When inserting and updating data

· When running groom table to reorganize data blocks

Workload management

Most of today's Data Warehouses use multiple users to run hybrid tasks simultaneously. Server Load balancer is an indispensable function in such a system. Netezza provides a simple and flexible workload management solution, including the following policies:

· Resource guarantee allocation (GRA): defines resource groups and assigns resource ranges to each Resource Group to control the proportions of different users to system resources.

· Priority-based execution query (PQE): Each query can be given a low-and high-priority. When the system resources are insufficient, the high-priority query will be executed first. PQE and GRA can be used in combination, that is, queries in the same resource group can also be distinguished by specifying different priorities.

· Short query priority (SQB): ensures that short queries are not affected by large data volumes, even when the system resources are very tight, short queries can return results quickly without waiting until the query of large data volumes ends. This is achieved through a certain amount of resources reserved by the system. Netezza checks whether a query is short based on the generated execution plan. By default, it is considered that a query with less than 2 seconds is a short query.

High Availability

All components of the Netezza all-in-one machine have redundant backup, and there is no single point of failure. Its high availability is mainly reflected in the following three levels.

· Smp host: The All-in-One machine includes two SMP hosts. The hosts are master-slave in the past two days. The Linux HA software running on the host can ensure that the host goes down, all services can be transferred to the standby machine. The two hosts do not use shared disks to maintain shared key data. Instead, they use DRBD to synchronize key data to be shared.

· S-Blades: Each blade server base contains 6 S-Blades. When one of the S-Blades instances fails, Netezza will automatically allocate the disks managed by S-Blades that fail to be managed to other S-Blades on the same base for management. Users of read-only queries do not feel the existence of such switching.

· Disk: Each disk stores the data of one data disk and an image backup of another data disk. For example, Disk 1 contains data on data disk 1 and image on data Disk 2, Disk 2 contains data on data Disk 2 and image on data disk 1. When Disk 1 fails, S-Blades will automatically access the image data on Disk 2, and the system can still maintain normal operation. In addition, there are some idle disks in the disk storage cabinet that do not belong to any data disk. When a disk fails to be idle, the disk will generate data again, for example, Disk 1 fails, in this case, the idle disk X will replicate the image of data disk 1 from Disk 2 and the data of Data Disk 2 will regenerate the failed Disk 1. This process takes some time, in this process, S-Blades still uses Disk 2 to access data of Data Segment 1. Once this process is completed, S-Blades will access the data of Data Segment 1 through the new disk. The entire process is transparent to users.

Conclusion

The all-in-one machine technology is the trend of Data Warehouse development in the future. major manufacturers have also increased their investment in this direction. Through this article, you should have an understanding of the Netezza all-in-one technology and understand why it looks so simple but has extraordinary performance. I hope you will have the opportunity to experience the surprises it brings to you.

Author jaminwm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.