We do high-performance computing friends, presumably to the CPU implementation mode is already very familiar with it. Modern high-level CPUs typically use superscalar pipelining, which enables parallel execution of several mutually independent instructions-called instruction set parallelism (Ilp,instruction-level Parallelism), and SSE (streaming SIMD), like x86 introduced Extension), AVX (Advanced Vector Extension), and arm's neon technology belong to data-level parallelism (Data-level Parallelism). and the implementation of GPGPU and CPU compared with a lot of differences. Here, in order to be able to better understand and use OpenCL, we would like to talk about the current mainstream of the GPGPU implementation mode for the super-calculation.
The following are mainly for Nvidia's Fermi architecture and AMD's TeraScale3 (Radeon HD 6900 series) and GCN architectures.
NVIDIA ' s Next Generation cudatm Compute architecture:fermi 
AMD accelerated Parallel Processing OpenCL Programming Guide  (may require FQ)
AMD Ultra-count Special card kingdoms 
NVidia GPGPU vs AMD Radeon HD Graphics Execution mode comparison