The current High-performance computing system (high-performance computing Systems, HPC systems) is designed to efficiently perform floating-point-intensive work by 1. HPC systems are primarily used for scientific simulations, which have features such as high computational density, high localization, and conventional partitioned data structures. These application requirements drive the processor's design toward faster SIMD (single instruction, majority) architecture units and a deeper cache hierarchy that continues to reduce access latency.
At the system level, memory and interconnect bandwidth are much slower than peak computing performance, but regularity and localization mitigate the impact of the problem. At the same time, the emerging processor architecture is also driving application development to continually explore the implementation of their own characteristics.
However, applications from emerging fields are irregular, such as bioinformatics, community discovery, complex networks, semantic databases, knowledge discovery, natural language processing, pattern recognition and social network analysis. Typically, they use pointers based data structures, such as unbalanced trees, unstructured grids, and images. Although most of these data structures are parallel, they have poor localization in space and time. Effectively segmenting these data structures is a big challenge. In addition, these data structures typically change dynamically during application execution, such as adding or removing a node from an image.
Complex cache hierarchies are inefficient for this type of irregular application. System performance is mainly determined by the bandwidth of the chip, which is used to access the local data and the network access to the data on other nodes. In this case, a single control thread often fails to provide enough concurrency to invoke all available bandwidth. As a result, multithreaded architectures are often tolerated in several ways, rather than reducing memory access latency. For example: Switch between multithreading, constantly generate memory references, and maximize bandwidth utilization.
Cray XMT is a multi-node supercomputer specifically designed to develop and execute irregular application 2. Its architecture is based on three pillars: Global address space, fine-grained synchronization, and multithreading.
XMT is a distributed memory sharing (distributed shared Memory, DSM) system whose global shared address spaces are distributed evenly across the memory of different nodes in a fine-grained granularity. Each node integrates a Threadstorm custom processor, which is then switched on a circular basis between many hardware threads. This approach allows the system to tolerate the system latency of accessing local node memory while also tolerating the network latency associated with accessing remote node memory.
Unlike the latest HPC systems, XMT provides a system-wide programming model that simplifies the execution of large memory applications without the need to optimize localization. Even if modern HPC systems integrate multithreaded architectures like image processing units (GPUs), they are better suited to regular applications. So far, their design does not tolerate the latency of accessing other node memory. In many cases, they cannot even tolerate the latency of accessing other processor memory in the same node. In addition, the coordination of memory access and optimal bandwidth usage requires partitioning of data and a large amount of programming effort.
The CASS-MT project at the Northwest Pacific National Laboratory is currently studying large multithreaded architectures for irregular applications. We'll show you a taxonomy in multithreaded architecture and discuss how they relate to cray. Then we will propose some improved methods to evolve these architectures to assess the possible future design of XMT. The design will integrate multiple cores at each node while completing the integration of the next Generation network interconnection. Finally, we will show how the hardware mechanism integration of remote reference aggregation optimizes network utilization.
A multithreaded processor can simultaneously process instructions from different control processes within the same pipe. There are two basic types of multithreaded processors: one is to emit instructions only from a single process in the loop, and the other is to emit instructions from multiple processes within the same loop.
Many advanced unordered superscalar processors, such as IBM POWER6 and Power7 or the latest Intel Nehalem and Sandy Bridge architectures, support synchronous multi-threaded technology (simultaneous multithreading technique, SMT). SMT keeps multiple threads active in all cores: The processor recognizes each individual instruction, and then sends them to multiple execution units in the kernel at the same time, maintaining high utilization of processor resources.
A multithreaded processor that emits instructions from a single thread per clock cycle, called a temporary multithreaded processor, alternating between threads to keep pipes (usually in order) filled and avoid congestion. Temporary multithreading can be coarse grained (block multithreading) or fine-grained (instruction/cycle interleaved).
Block multithreading switches from one thread to another only if the instruction produces congestion with long delays, for example, the cache loses access requests to the external memory. Intel's Montecito uses a block of multithreaded technology.
Interleaved multithreading swaps threads on a circular basis. The Threadstorm processors in the Cray XMT, as well as the predecessor of the Tera MTA and Cray MTA-2, use the interleaving processor. The SPARC kernel in Sun UltraSPARC T1, T2, and SPARC T3 also uses interlaced multithreaded mode.
The UltraSPARC T1 contains 8 cores, each with 4 threads. T2 also has 8 cores, but doubles the number of execution units and threads to allow it to emit instructions at the same time in each clock cycle from 8 threads. SPARC T3, on the basis of UltraSPARC T2, doubled the number of cores. These cores emit instructions from different threads in each cycle. When a long delay event occurs, they delete the thread that generated the event from the dispatch table until the event completes 3.
At the same time, the GPU also integrates multithreaded scheduling blocks (curved or wave-side), which can be effectively switched in the SIMD execution unit to tolerate long latency memory operation 4. The GPU has hundreds of floating-point units and a large amount of memory bandwidth, and is optimized for access to commonly used data on chip memory. But the current GPU is designed to have a private memory accelerator that is better suited to regular workloads.
Typically, a cache based processor encounters a large cache loss due to unpredictable memory access in an irregular application. In general, temporary multithreaded architectures are better suited for these applications because they can switch to other ready threads to tolerate long latency memory access when the memory subsystem loads or writes data, so they do not need caching to reduce access latency.