Some network-picked hpc materials and network-picked hpc Materials
Source from: https://computing.llnl.gov
Factors determines a large-scale program's performance
4 * Application related factors:
5 * algorithms
6 * dataset size
7 * Memory Usage Pattern
8 * Use of IO
9 * Communication Patterns
10 * Task Granularity
11 * Load Balancing
12 * Amdahl's Law
13
14 * Hardware factors
15 * Processors Architecture
16 * Memory Hierarchy
17 * I/O configuration
18 * Network
19
20 * Software factors
21 * OS
22 * Compiler
23 * Preprocessor
24 * Communication protocols
25 * Libraries
Performance analysis:
Timers, Profiles, system stat, memory tools
Learn some about hardware archiecture:
Intel® Xeon® 5500/5600
4-core/6-core
2.4/2.8 GHz
Cache
L1 Data 32Kb, private
L1 Instruction 32Kb, private
L2 256 K, private
L3 8 Mb/12 Mb, shared
Cpu-Memory bandwidth: 32 Gb/s
Intel Xeon E5-2670
8-core, 2.6 GHz
Cache
L1 Data 32 K, private
L1 Instruction 32 K, private
L2 256 K, private
L3 20 Mb, shared
CPU-Memory bandwidth 51.2 G/s
AMD processors
2.2 GHz
Cache
L1 Data 64 k (2-way)
L1 Instruction 64 k (2-way)
L2 512 K private
L3 2 M shared
Direct-connect Architecture
CPU-memory bandwidth 10.7 G/s per socket F
Other connect socket bandwidth 8 G/s (2-way)
4x Infiniband Interconnect
* SDR 1.25 G/s
* DDR 2.5 G/s
* QDR 5 G/s
Learn something about NUMA
-Physical: each node has sevearl (2-4) sockets, each socket has sevearl (4-8) CPU cores. on same socket, cores share L3 cache; socket-socket communcation through CPU-memory bus, almost 2x ~ 5x slower.
-Design consideration: CPU affinity (numactl -- cpunodebind), local memory policy. other compiler/running-time options (mpirun -- bind-to-socket-bynode)
Finally and most importantly, a good algorithm.