Take the Cc-numa Express
-Xu Yaochang
Reprinted from: Network World
Editor's note: Numa--non uniform Memory access achitecture, heterogeneous memory access structure, sounds complex and unfamiliar names. It was in the late 80 's as a Stanford University research project was born, the 90 's, only as a commercial server to market. Today, NUMA systems are capable of running some of the world's largest UNIX database applications, and are widely accepted as the mainstream technology for E-commerce, including processing capabilities, massive I/O scalability, high availability, workload and resource management flexibility, without changing the SMP programming model. In this issue of technical features, we will interpret the characteristics of NUMA from a variety of bandwidth, architecture, and comparisons to SMP and clustering, with a view to giving the reader a deeper understanding of the technology. The importance of computer bandwidth for each signal to occupy a certain frequency range, we call the frequency range of bandwidth. In order to make the signal as small as possible through the channel, the bandwidth of the channel is required as wide as possible. Shannon (Shannon) theorem tells us that the ultimate data rate of signal transmission is proportional to the bandwidth of the channel. It is well known that the bandwidth of different channels is different. For digital signal transmissions, the number of bits transmitted per second is usually used as a unit. such as coaxial cable data transmission rate of 20Mbps, optical fiber data transmission rate of up to thousands of Mbps.
The bandwidth of a computer system refers to the number of operations that can be performed in a unit time. Channel or storage bandwidth refers to their data transfer rate. The data transmission rate between each part of the computer should be balanced.
The server is the network operation, the management, the service center, is one of the network system key equipment, in the network system is in the core equipment status. The server should not only have fast processing speed, large capacity, good fault tolerance, scalability and so on, in order to ensure adequate data transmission rate, should have the necessary bandwidth. memory bandwidth and equalization according to statistics, the CPU speed of microprocessors increased by about 80% a year, and memory access speed of only 7% per annum, so the ratio of CPU to memory performance with a geometric ratio increased. The concept of machine equilibrium has been defined in many studies, and for a particular processor it is defined as the ratio of the number of memory accesses per CPU cycle to the number of floating-point operations per CPU cycle: (floating-point operand/cpu cycle)/(Memory access number/CPU cycle). Because it does not take into account the true cost of memory access in most systems, the results are not reasonable.
In order to overcome the deficiencies in the above definition, it is redefined as the number of memory accesses that are processed by an unassigned cross cell vector operand over a longer period of time, thus equalization = (peak floating-point number/cpu periodic)/(number of accesses of the memory in the duration/cpu cycle)
Based on the above definitions, the results of the computer testing of the current architectures are as follows. Single processor: Balance generally to good, performance low to Medium.
Shared memory: Poor balance, scalability, average performance.
Vector machine: good balance, medium scalability, good performance.
Distributed Memory: The equalization is general, the expansibility is good, the performance is good. Computer Architecture List Single-processor architecture
In a tiered-storage computer, the key factor determining persistent memory bandwidth for a CPU is the cache's missed wait time. At present, the cache of the machine with a significant change in the storage system, memory access to wait time and transmission time ratio greatly changed, 1990 20MHz machine wait time and transmission time is approximately equal, 1995 100MHz of the machine waiting time accounted for the majority.
the architecture of shared memory
Vector machines belong to shared memory architectures (except distributed shared memory machines). It greatly simplifies cache consistency challenges and the resulting latency (processing latency). However, the vector machine is more expensive than the superscalar machines for shared or tiered storage.
Machine and Vector shared memory computers with cache have fixed memory bandwidth limits, which means that their machine equalization values increase with the number of processors, so the number of processors has a limit. Typically, shared memory systems are non-blocking (non blocking) between processors, allowing multiple CPUs to be concurrently active, which can compensate for the large latency caused by the wait time. When using multiple processors, the cache hit rate for the machine is determined by latency, bandwidth throttling, and bus/network/crossover switch controller constraints. In a vector computer, the limit is mainly on bandwidth rather than waiting time.
symmetric multi-processing (SMP) shared memory system
symmetric multiprocessing (SMP) nodes contain two or more identical processors with no master/from processing. Each processor has equal access to the node's computing resources. The interconnection between the processors and the memory in a node must take advantage of an interconnected solution that can be consistent. Consistency means that the processor can only hold or share a unique number for each data in memory at any time.
The SMP shared memory system connects multiple processors to a centralized memory. In an SMP environment, all processors access the physical memory of the same system via the bus, which means that the SMP system runs only one copy of the operating system. Applications written for a single-processor system can be run without change in an SMP system. Thus, SMP systems are sometimes referred to as homogeneous memory. For all processors, the time required to access any address in the storage is consistent.
The drawback of the SMP architecture is that scalability is limited, because increasing the processor does not achieve higher performance when the memory interface is saturated. The number of SMP processors can generally reach up to 32.
a new distributed storage Cc-numa Architecture
Recently, some vendors have started to introduce new systems that connect SMP nodes together to make it easier to scale than bus based SMP. These nodes are interconnected through a fibre channel, and their wait times can be different. Typically, a "near" memory from a CPU is faster than a "far" memory, which is what n "non" means in NUMA. Because Uma refers to homogeneous memory access, that is, the time of each CPU access to the memory is basically the same. and NUMA means non-uniform memory access. NUMA maintains a single operating system copy of SMP system, simple application programming mode and easy management feature, and can effectively expand the scale of the system. As for CC, which refers to the consistency of the cache (cache coherent), where the contents of a storage unit are overwritten by a CPU, the system can quickly notify the other CPUs (via dedicated ASIC chips and Fibre Channel). The practice shows that moderate nonuniformity can work well, and the access time of remote and local memory is proportional so that programmers can adopt the message passing mechanism of similar network. The memory distributed around each CPU is physically separate, but logically unified so that large applications can be run without parallel programming and parallel compilation. let Cc-numa "turn" in order to understand how Cc-numa works, start with the traditional symmetric multiple processing (SMP) structure. SMP is a plurality of processors that communicate with each other and communicate with a shared storage group through a transmission mechanism called an interconnect bus.
Cc-numa is similar to SMP, capable of handling multiple connected processors, each of which can access a common storage group. This structure divides the processor into several nodes, each of which is interconnected in each node, communicates with each other, and communicates with the local memory in the node to mitigate the bus congestion of the SMP. For example, a 64-processor server can be divided into 2 large nodes, each with 32 processors and its own storage group. The processor can also access the storage group in all other nodes, which varies with the distance of the node, Cc-numa is more scalable than SMP, and is relatively easy to manage with just one operating system.
Unlike clusters, clustering uses a loosely binding approach, communicating with each other, with a long internal exchange and a large consumption.
And a few machines as a system management, it is bound to increase the difficulty of management. The Cc-numa computer is different, no matter how many processors it has inside, it is simply a computer for the user.
In short, Cc-numa overcame some of the drawbacks of SMP and clustering, and played a role in areas where they could not perform.
Typical Cc-numa systems include the convex exemplar,hp V2600 NUMA series, IBM's NUMA Q and SGI Origin series. The convex and IBM systems use the "ring" interconnect structure of the 32-bit Intel chip, where each SMP node is plugged into the ring and the message requests and replies are circled around the network. Product Article HP V2600 NUMA server
V2600 runs in the HP-UX 11 operating environment and can support 15,000 applications, including primary databases and ERP applications. The V-Series server architecture is based on HP's (sca--Scalable computer architecture) scalable computer architecture based on highly scalable nodes, consistent cache, and Non-uniform storage access technologies.
The V2600 operating environment is based on a 64-bit HP-UX UNIX operating system that provides a complete 64-bit environment, including 64-bit system kernel and address space, 64-bit file size and file system size, and 64-bit file data types.
The V-class system has the following characteristics, such as HP Hyperplane: a 61.2GB/S transmission system with crossover bandwidth. The crossover switch provides a smooth access from CPU and I/O pipes to the access system. The use of HP Hyperplane prevents system bus systems from processing memory and I/O traffic from falling. Each node contains up to 32 processors, 8 access boards and 7 I/O channels that support 28 PCI I/O controller interfaces, including 2~128 64-bit PA 8600 processors, maximum 128GB SDRAM, maximum 7.68 GB/s I/O pipeline throughput; Up to 112 industrial standard PCI I/O control interfaces.
To further expand its capabilities, HP has used its own 4 parallel v-Series server systems, which are based on memory interconnection, to provide true support for Cc-numa. It has the advantages of SMP and distributed access architectures, including the scalability of programming modules, SMP features, and distributed access subsystems. SCA Interconnection is a multilevel memory subsystem, the first level consisting of traditional SMP memory. The second level of the memory subsystem is created by leveraging a dedicated interconnect to the first level of memory, which provides multiple two-way loops for high bandwidth and fault elasticity. SCA hyperlink is implemented as a series of one-way links (point-to-point connections between nodes), allowing cross loops to be accessed in a way that is similar to a cross memory request, and allows multiple requests to be made at any given time to reduce latency. Because of HP's advantage on SMP, each node is large, but the number of nodes is small, the structure is simpler, the efficiency of transmission is improved, and the probability of failure is reduced. In order to improve the rate of access, the internal full through the memory connection, in the local memory to do remote memory cache, local CPU access to the cache, the local cache, the first view, so that the access time greatly shortened.
IBM numa-q Architecture (formerly Sequent)
The IBM numa-q system uses a Cc-numa or cache-related NUMA scheme. The memory unit is adjacent to the multiprocessor SMP building block to maximize the overall system speed in the Cc-numa system. These building blocks are connected together through an interconnected device to form a single system image.
Hardware caching means that software is not required to maintain the latest features of multiple copies of data or to transfer data between multiple operating systems and applications. Everything is managed at the hardware level, as in any SMP node, where a single case operating system uses multiple processors.
The NUMA-Q system uses an intel-based 4 processor or "quad" building block, which also includes memory and 7 input/output slots. Currently, IBM Numa-q supports up to 16 building blocks or 64 processors, which are connected by a hardware-based, cache, and highly scalable interconnected device that forms a single NUMA system in the same way as adding a processor substrate to the backplane of a traditional large-bus SMP system. The existing NUMA-Q architecture enables 64 building blocks or 256 processors to be supported in a single system node.
NUMA-Q can provide integrated MVS-style multipath switched Fibre Channel SAN (storage domain network) UNIX systems. This functionality is a key start-up technology for E-business and customer Relationship Management (CRM) applications that require support for a large, high-performance transaction and data warehousing environment. Fibre Channel sans allow large back-end UNIX database machines and hundreds of front-end UNIX or NT application servers to use a common switched fiber to efficiently share data center class disk storage and tape libraries.
Numa-q directly runs I/O to its attached storage device through a switched Fibre Channel San fiber, rather than by processing the interconnect of memory accesses. On NUMA Q Systems, this eliminates the resource preemption as the processor increases the throughput of the large SMP system. Also, because of the I/O multipath features supported at the operating system level, NUMA Q provides the only inherent fault tolerant SAN.
In a 24x7 e-business environment, it is an important advantage to manage resources and to utilize resources online and offline without disrupting the operation of the system. Numa-q will be able to implement this functionality in the next generation of switched systems that run UNIX and Windows NT. This advanced computer system design provides a combination of performance, scalability, availability, and manageability for online business-critical UNIX and Windows NT systems that other architectures cannot support.
SGI DSM Cc-numa
The SGI DSM (distributed shared Memory) Cc-numa system is very different, using a 64-bit RISC chip, crossover switch, and Craylink Fibre Channel to interconnect with a "fat" hypercube structure, which can be run with high bandwidth and low latency. Because SGI's DSM Cc-numa uses modular architecture, distributed memory, router-specific chips, and distributed I/O ports, the input/output of information is connected to a dedicated chip interface via an intelligent crossover switch that can be used with PCI, VME, SCSI, Ethernet, ATM, FDDI and other I/O devices are connected so that they can provide very high rates of data transfer on the network. With the support of the SGI 64-bit operating system IRIX, its DSM Cc-numa system bandwidth and memory performance can increase proportionally with the number of CPUs, at present, the SGI Cc-numa machine is already within 512 CPUs and bandwidth can grow linearly with the number of CPUs. So it can be called a good "bandwidth machine".
SGI Bandwidth Machine--origin Server, its high-speed memory bandwidth up to 26gb/s, high-speed I/O bandwidth up to 102.4gb/s, maximum memory up to 256GB, and online Fibre Channel disk capacity up to 400TB. Whether you are connecting to a data warehouse, storage or retrieval, product data management, or a Web service for thousands of customers, the origin server is competent. Concept Chapter large-scale parallel processing
Large-scale parallel processing (massively Parallel processing or "sharenothing") nodes are traditionally composed of a single CPU, a small amount of memory, partial I/O, interconnection between nodes, and an instance of the operating system for each node. The interconnection between nodes (and the operating system instances that reside on each node) does not require hardware consistency because each node has its own operating system and its own unique physical memory address space. Thus, consistency is implemented through "message passing" in software.
MPP performance tuning involves data partitioning to minimize the amount of data that must be transferred between nodes. Applications that have a natural partition of data can operate normally on large MPP systems, such as video-on-demand (video on demand) applications.
COMA
Coma means "cache-only Memory Architecture" (Cache only Memory architecture), which is a competitor to the ccNUMA architecture, with the same goals, but not implemented in a different way. The Coma node does not distribute the memory parts, nor does the entire system remain consistent through cutting-edge interconnected devices. The Coma node has no memory, and only a large cache is configured in each building block. Interconnection devices must still maintain consistency, and an operating system copy runs across the building block, but there is no "local" deposit for specific data. Coma hardware can compensate for inappropriate operating system algorithms associated with memory allocation and process scheduling. However, it needs to modify the operating system's virtual storage subsystem and, in addition to the cache consistency Interconnect board, requires special custom memory boards.
Cluster
A cluster (or cluster system) consists of two or more nodes, each running its own copy of the operating system, each node running its own application copy, and the node sharing other common resource pools. In contrast, the nodes of the MPP system do not share storage resources. This is the main difference between a clustered SMP system and a traditional MPP system. It is important to note that locks must be performed in the cluster system before attempting to update any part of the shared repository (database) to maintain consistency within the database. It is because of this requirement that the management and expansion of the cluster is more difficult to implement than a single SMP node.
Getting more performance from a clustered system is more difficult than expanding within a node. The main obstacle is the cost of communication outside the single node environment. Transferring information to and from the node must endure a long delay in software consistency. Applications with large amounts of interprocess communication work better in SMP nodes because of the rapid communication. By reducing interprocess communication requirements across nodes, applications can achieve more efficient scaling in both clustering and MPP systems.
Reflective Memory Cluster
The reflective memory cluster (reflective Memory CLUSTER,RMC) is a clustered system that has a memory replication or dump mechanism between nodes and a locked information flow interconnect device. Dumps are performed using software consistency techniques. The reflective memory cluster system provides faster messaging for applications and allows each node to get the same memory page without having to go through the disk. In the RMC system, obtaining data from other nodes ' memory is a hundredfold faster than returning disk data. Obviously, only the nodes need to share data, and the application can take advantage of the shared data to achieve performance improvement.
Reflective memory clusters are faster than traditional network-based messaging because once a connection is established, the message can be sent by the application node without interfering with the operating system.
NUMA
The NUMA category consists of several different architectures, which, in broad terms, can be considered to have non-uniform memory access latency: RMC, MPP, ccNUMA, and COMA, but the differences between them are considerable. RMC and MPP have multiple nodes, and part of "NUMA" is software consistency between nodes. As for ccNUMA and coma, its hardware consistency is within the node, and its "NUMA" component is within a node.