In the current computer application, the demand for fast parallel computing is extensive, summed up, there are mainly three types of application requirements:
 
 
  
  - Computationally intensive (computer-intensive) applications such as large-scale scientific project calculations and numerical simulations;
- Data-intensive (data-intensive) applications such as digital libraries, data warehouses, data mining, and computational visualization;
- Network-intensive (network-intensive) applications such as collaborative work, remote control, and telemedicine diagnostics.
There are three main types of parallel programming models: multithreaded programming models for shared memory. A message-passing programming model for distributed memory, a hybrid programming model.
In a computer system. Processors are always visiting the fastest storage space, such as L1 cache->l2-> Local node memory, remote node memory/disk, and storage capacity at all levels is the opposite of the speed of access.
In parallel computing, the design of parallel algorithms is the key to determining performance. Some problems are inherently well-parallelized. For example, the data set to be processed can be better decoupled, while some problems require a complex formula derivation and conversion to fit parallel computing. At the same time, avoid possible bottlenecks in the calculation process. Task partitioning to take full account of load balancing, especially dynamic load balancing, the idea of "peering" is one of the keys to maintaining load balancing and maintaining scalability by avoiding the use of master/slave and client/server patterns as much as possible at design time.
1. Parallel machine System
The development of parallel machines from SIMD to MIMD. Derived in addition to four classic architectural patterns: SMP (symmetric shared-memory multiprocessor. For example, a frequently used multi-core machine. Poor scalability. Number of processors 8~16), DSM (distributed shared-memory. The physical memory is distributed across the processing nodes, and the logical address space is used for unified addressing and therefore belongs to shared storage. Access time is limited by network bandwidth). MPP (Massive Parallel Processor. A large-scale system consisting of hundreds of processors, a symbol of the country's comprehensive strength. )。 Cluster System (Cluster. Interconnected homogeneous or heterogeneous set of independent computers, each node has its own memory, I/O, operating system, can be used as a single machine, the node between the use of commodity network interconnection, flexibility.
Hardware: Multi-core CPU (Intel, AMD), GPU (Nvidia), Cellbe (SONY&TOSHIBA&IBM, including a master processing Unit and 8 co-processing units)
Concept: Data bus address bus control bus (register) bit number
2. Parallel programming models and tools
–mpi–
MPI (Message passing Interface) is a message-passing programming model. Service to process communication. It is not specific to a certain implementation of it, but a standard and normative representative, it is a library descriptive narrative, rather than a language, easy to use and highly portable. To be blunt is some programming interface.
–openmp–
The Open multi-processing is a portable parallel programming model for shared memory multiprocessor architectures. The interface was initiated by SGI Corporation.
Includes the compilation guide, the execution function library and the environment variable three parts, has the serial equivalence (whether using one or more threads to execute a program, all bring the same result, easier to maintain and understand) and incremental parallelism (the processor starts with a serial program, and then looks for those snippets that are worth parallelization).
The execution model of Openmpi is used in fork-join form. That is, the main thread-from threads. Reduces the difficulty and complexity of parallel programming.
Compiler guidance statements, supported by Visio Studio, enable OpenMP to be seen as a parallel program or as a serial program, or to easily rewrite a serial program as a parallel program while keeping the serial program part intact.
–mapreduce–
Google. PageRank the construction of the inverted table index.
Map inputs input into the middle of the key/value pair, reduce the key/value synthesis finally outputs output.
–hadoop–
Open source version number for MapReduce. Hfds,namenode (Jobtracker), DataNode (tasktracker), cluster architecture.
–cuda–
The GPU Parallel computing tool developed by NVIDIA.
–cellbe–
The main goal of Cellbe is to increase the processor performance of PlayStation2 by 10 times times, and in 2006 IBM introduced the Cell Blade computer system.
References: Fundamentals of parallel computer Programming & CUDA Course
 
Parallel Computing Fundamentals & programming models and tools