Windows Azure Platform Family of articles Catalog
In the previous article, I introduced the basic content of the MPP architecture
In this chapter, I introduce you to the architecture of Azure SQL Data Warehouse (SQL DW).
The 1.SQL DW is divided into head node and work node, denoted by control node and compute node
SQL DW uses multiple work node scale-out methods to support a large number of PB-level relational data.
The application sends the T-SQL command to head Node. Head node uses the MPP engine, which optimizes queries for parallel processing and then sends queries to work node for parallel queries
Work node saves data that needs to be processed in azure storage and queries in parallel
Data Mobility Service (DMS) is an internal service for SQL DW that can move data across nodes as needed to run queries in parallel and return accurate results
2.SQL DW is compute and storage separation
The user's data is saved in Azure Storage and is not saved on the local disk of work node.
SQL DW implements the logical dependency of work node and user data, and the data does not move
That is, when the user's work node is scaled from 6 to 10, the SQL DW re-sets the logical relationship of 10 work node to azure storage
3.Azure Storage
SQL DW is used at the bottom of Azure Premium Storage, which is SSD storage.
Users can also set the distribution mode of database tables on SSD storage. The distribution patterns of data tables supported by SQL DW are:
(1) Polling Round Robin
(2) Hashing hash
(3) Copy Replication
4.Head Node
Head node is the brain of SQL DW. Head node is at the forefront of processing application interactions and links.
The MPP engine on head node optimizes and coordinates parallel queries.
When a user submits a T-SQL query, Head node transforms the query into parallel queries on all work node
5.Work Node
The work node is the one that really makes the calculation. The number range for the work node node is 1-60.
Each work node has a node ID that is visible in the system view. We can view the compute node ID by looking for the node_id column in the system view that begins with the sys.pdw_nodes name
6. Data partitioning mode
Partitioning is the basic storage and processing unit for parallel queries that run on distributed data.
When SQL DW queries, the task is divided into 60 sub-queries that are executed in parallel. Each subquery runs on a 1 data distribution.
By default, the user's data is divided into 60 parts (60 partitions).
When the number of work node nodes is 1, 1 jobs node processes 60 data.
When the number of work node nodes is 2, 2 jobs node processes 60 data, each work node processes 30 data
When the number of work node nodes is 3, 3 jobs node processes 60 data, each work node processes 20 data
......
When the number of working node nodes is 60,60 work node processes 60 data, and each work node processes 1 data, which is the highest degree of parallelism
For example, suppose we have only 1 databases, this database has only 1 tables, and this table has 60 million rows of data
When the number of working node nodes is 1, 1 work node processes 60 million rows of data, which is the lowest degree of parallelism
When the working node number is 2, 2 work node processes 60 million rows of data, and each work node processes 30 million rows of data
When the working node number is 3, 3 work node processes 60 million rows of data, and each work node processes 20 million rows of data
......
When the number of working node nodes is 60, 60 work node processes 60 million rows of data, and each work node processes 1 million rows of data, which is the highest degree of parallelism
Observation can be found that setting the SQL DW partition mode is very important
As we have described above, there are three types of partitioning patterns supported by SQL DW: Polling Round Robin, hashing hash, copying Replication
7. Rotation Round Robin
Polling tables are the simplest distribution pattern, typically used for temporary tables
The data in the polling table is evenly distributed and does not perform any optimizations. When using the rotation table, SQL DW randomly selects a distribution key and then randomly writes the data to the polling table
Unlike hash distribution tables, rows with equal values are not necessarily assigned to the same distributions.
8. Hashing Hash
Hash tables provide table join (join) and aggregate queries (aggregation) for large database tables, providing the highest performance
When using the hash table, SQL DW uses the hash function to assign each row to the same partition.
In a database table, define one of the columns in the column, and use the hash function to save the data to the same partition
Description of the hash Table
9. Copy Replica
Replication tables provide the fastest query performance for tables with small amounts of data
The Copy table caches a complete copy of the table on each work node. Therefore, when using a replicated table for table join (join) and aggregate queries (aggregation), no data is generated to move on work node
Copying tables is best used for tables with small amounts of data, which is better when the table size is less than 2GB
Describes the replication replica table
Azure SQL Database warehouse Data Warehouse (2) schema