Experts interpret Microsoft Parallel Data Warehouse

Source: Internet
Author: User
Tags hash management studio sql server management sql server management studio

Microsoft's Parallel Data Warehouse (Parallel data Warehouse, abbreviated PDW) was released last year with SQL Server 2008 R2, designed to compete with Oracle Exadata and Teradata. PDW a true sense of the ability to mix workloads, users can extend data from multiple physical servers with a familiar SQL Server database engine.

Parallel Data Warehouse is not a software system product, after purchase you cannot simply install it on the hardware. The most basic configuration of PDW is a two-rack, one of which is a management server, as a management node, a control node, a Lzone node, and a backup node. Another rack will hold the so-called compute node to actually store the data, which we call a data rack, which is a 10-chip server (specific analysis) and proprietary SAN storage devices. Extending the PDW is simple, adding more data racks.

Control nodes provide a range of functions, most importantly, they provide the ability to query parallel data warehouses. When you connect to the control node, instead of using the familiar SQL Server Management Studio (SSMS), you use a third-party tool--nexus Chameleon, which is contained within the PDW installation. Since SSMS cannot be directly connected to the PDW control node, users need to familiarize themselves with the tool before using PDW.

Running on the control node is a special version of the SQL Server database, which can send Transact-SQL queries to the PDW compute node for execution. In fact, data from a data warehouse is not stored on this instance, and instead it holds metadata for database settings and provides staging space as a staging database.

In addition to processing data query requests, the control node will also process your data loading process. Data loading can be done in several different ways, the simplest of which is to use the business Intelligence Development Studio (BIDS) tool and the latest SQL Server Integration Services (SSIS) Target connection objects, The latter is a product specifically directed at PDW; another way is to load data into so-called head docking areas (landing zone) and then use a new tool called Dwloader to bulk load data to PDW. Dwloader works very much like a bcp tool, and it can quickly load data into a target table.

Many people think that Microsoft's parallel data Warehouse is similar to SQL Server or Oracle's partitioned tables, although the concept of the two sounds like it, but in fact there is a big difference. A partitioned table is one in which you say that a particular area of data is divided into separate partitions, this is enough to query only one partition at a time, and the principle in PDW is diametrically opposed, and you are pushing the current data to every compute node in the system so that when you query the number According to the actual time is a number of compute nodes to respond to data requests.

Therefore, when using all compute nodes, you can increase the number of CPUs and increase the total number of I/O generated at a time. Such a design adds a lot of computational resources and can greatly reduce response time. A few days before it takes a few minutes to use PDW, a report that the traditional SQL Server instance cannot run, PDW can be done quickly.

But the parallel data warehouse is not omnipotent, it also has its own shortcomings. For example, running insert inserts in PDW is slower than traditional SQL Server, and each row needs to be instrumented to determine which compute nodes are stored because the data is inserted into the PDW. When creating a table inside PDW, you need to make sure that the data needs to be replicated to all storage nodes, and that a dimension table can be set this way.

The second table setting option will be used to publish data between compute nodes, depending on each column. When the data is loaded, the declared column is hashed, and the hash value is compared to the same internal table to determine which control node stores the data. Because the data is loaded into the system through the control node, the hash process must be passed. The row data is then transferred from the control node to the compute node for storage.

Another drawback of PDW is the aggregation function. According to the written query, all aggregated data needs to be transferred from the compute node to the control node and stored in the staging database, and then aggregated when the transfer completes. If the volume of data is very large, this process will be time-consuming, because the transfer of data requires network connectivity to complete. To some extent, users can avoid this kind of situation, so we need to make a special design of the table, and re-examine our query statements. If you plan to do an aggregation log, be sure to thoroughly test all queries.

The infrastructure of the parallel data warehouse is highly redundant, and almost all of the components within the PDW are backed up within a few. Two management nodes as the host controller of the parallel Data Warehouse, the system administrator can allow the administrators to configure the other servers appropriately (PDW has its own dynamic directory and can isolate the machine from any other Group Policy). The control node acts as an active/passive Windows cluster, and the compute node can also be configured as a separate cluster, so if the physical server goes down, PDW will not lose all CPU computing resources.

Microsoft's Parallel Data Warehouse is a very powerful system in which companies can maximize their value according to their own needs. But remember that PDW is not a substitute for an instance of SQL Server, which works differently. Before purchasing a parallel Data warehouse, an enterprise must understand the content mentioned in this article to avoid unnecessary trouble caused by improper operation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.