Limited to the ability and time, there are many errors in the text, welcome to point out, email [email protected], look forward to discussion. As most of the original, even if the copy also indicates the source (if there is omission please point out), so reprint please indicate the source http://www.cnblogs.com/e-shannon/
http://www.cnblogs.com/e-shannon/p/7495618.html 4 Open Coherent Acceleration interface
As mentioned in section I, in order to meet the accelerating demand, the industry defines open standards for the CPU High Performance conformance interface (performance coherence interface), which appeared in 2016 OPENCAPI/GEN-Z/CCIX three open standards
The purpose of the three standards is similar, with a slight difference in emphasis, and the members cross each other, even with members in three groups (of course, Intel is not in these three groups). It is worth mentioning that the acceleration interface not only accelerates the CPU, but also provides a high-speed interface for the future, such as high-speed memory, high-speed network storage, high-speed network, etc., to form an efficient computer group.
respective URLs: www.ccixconsortium.com
http://genzconsortium.org/
www.opencapi.org
Excerpted from Ccix,gen-z,pencapi_overview&comparison.pdf
CCIX physical media is based on PCIE3.0 for processor and accelerator full cache consistency for low latency memory expansion, CPU acceleration, and networked storage.
Gen-z, however, focuses on the acceleration of inter-connection consistency between frames, supports the PCIe physical layer and the adjusted 802.3 electrical layer, and, of course, claims to support memory, network equipment, etc. "Gen-z ' s primary focus is a routable, Rack-level interconnect to give access to large pools of memory, storage, or Acceler Ator Resources,
OPENCAPI is supported by IBM POWER9, with BlueLink high-speed interface, about gb/sec, low latency, and ultra-wideband matching the main memory bandwidth. Opencapi'll be concerned primarily with attaching various kinds of compute to each other and to network and storage Devi Ces that has a need for coherent access to memory across a hybrid compute complex. . With Opencapi, I/O bandwidth can be proportional to main store bandwidth, and with very low nanosecond to Nanoseco nd latencies, you can put storage devices out there, or big pools of GPUs or FPGAs accelerators and let them has access to Main store and just communicate to it seamlessly. Opencapi is a ground-up design This enables extreme bandwidth that's on par with main memory bandwidth
Gen-z information is easy to obtain, the most difficult is ccix, need members.
YXR Note: There is no careful study of the difference between the three, so simply copy out the data, for reference only, actually feel the author did not delve into
But feel
This article focuses on Opencapi because of its Power9 support. 4.1 Opencapi
IBM was originally designed to design an open standard, and CPU architecture-independent acceleration interface, so the Opencapi stripped out of the openpower, so that other CPU manufacturers can also join (unclear whether Intel joins, after all, both have cooperated InfiniBand) In the text https://www.nextplatform.com/2016/10/17/opening-server-bus-coherent-acceleration/unexpectedly also called Opencapi is CAPI3.0, no language.
Opencapi its hierarchy is similar to PCIe, a total of three layers, respectively, is the PHY layer, DL (data link), TL (trasaction layer), but unlike PCIe, follow the open CAPI is agnostic (not known) to Processor architecture, not defined PHY layer, and die by the user's own definition, IBM's POWER9 used BlueLink, which can be reused with Nvlink (see POWER9 picture). Opencapi only defines the DL and TL,TL layers also use credit for flow control, Opencapi uses virtual Adress, compared to PCIe, the biggest advantage of optimizing the delay latency, simplifying the design, the power area is better than the PCIe
The limitations of the PCIe architecture, latency, bandwidth still keep up with memory bandwidth, and the lack of coherency, in the Power9 Opencapi, the bluelink physical interface, 25Gbps x 48lanes, And can run nvlink2.0 on it, supporting Nvidia's GPU acceleration. This is also the Google and server manufacturers Rackspace on the "zaius[dream1" server to adopt the OPENCAPI port, Xilinx also launched the reason for supporting IP. The pci-express stack is a limiter in terms of latency, bandwidth, and coherence. This is what Google and Rackspace are putting OPENCAPI ports on their co-developed POWER9 system, and why Xilinx would add t Hem to their FPGAs, Mellanox to their gb/sec InfiniBand cards, and Micron to their flash and 3D XPoint storage.
YXR Note: Since OPENCAPI does not define the PHY layer, other CPU vendors, Arm,amd,intel can also define their own PHY, on which to run nvlink2.0 and Opencapi.
The following is the comparative advantage of OPENCAPI, remembering that PCIe's round trip latency for 100ns,gen-z seems to be the same goal.
1. Server Memory Latency is critical TOC factor
Differential solution must provide ~equivalent effective latency of DDR standards
POWER8 DMI round trip latency? 10ns
Typical PCIe round trip latency? ~100s NS
Why are DMI so low?
DMI designed from ground up for minimum latency due to LD/STR requirements
Open CAPI Key Concept
Provide DMI like latency, but with enhanced command set of CAPI
2.Virtual address based Cache,
Eliminates kernel and device driver software overhead
Improves accelerator performance
Allows device to operate directly on application memory without Kernel-level data copies or pinned pages
Simplifies programming effort to integrate accelerators into applications
The virtual-to-physical Address translation occurs in the host CPU
YXR Note: There is data mention Opencapi disadvantage is to OPENCAPI 4.0 to achieve the cache coherent, now is memory coherent, I do not understand 4.1.1 DL
The OPENCAPI data link layer supports each LANE25 Gbps serial, with a basic configuration of 8 lane, 25.78125GHz per day. On the host side, called DL, on the Opencapi side is called DLX
Link training starts with an out-of-band signal OCDE Reset, and a link training is divided into three parts: PHY training,phy initialization, DL training
Complete speed matching, clock matching, link synchronization, and lane information exchange via training.
DL flow control using Flit package, should be the ack,replay mechanism to complete
DL using 64b/66b encoding method, LFSR scrambling code (specific formula unknown origin)
4.1.2 TL (to Be continued)
Core part, more complex
Comparison of 4.2 Opencapi and CAPI
As shown, Opencapi the PSL into the CPU side, the advantage is that the OPENCAPI is not related to the CPU architecture OpenPOWER, to facilitate the adoption of other CPU manufacturers, cache and consistency are encapsulated in the CPU. The physical layer uses bluelink to reduce pcie latency and increase bandwidth. Of course, the advantage of PCIe is the use of a large number of manufacturers, PSL (including the cache) into the Afu side, but also to overcome the limitations of PCIe.
While CAPI were governed by IBM and metered across the OpenPOWER Consortium, Opencapi are completely open, governed by the O Pencapi Consortium led by the companies I listed above. The OPENCAPI consortium says they plan to make the OPENCAPI specification fully available to the public at no charge befor E The end of the year. Mellanox Technologies, Micron, and Xilinx were CAPI supporters, OpenPOWER members, and is now part of OPENCAPI. NVIDIA and Google were part of OpenPOWER and is now OPENCAPI members
4.3 Answer yourself
These questions are mainly the doubts of their own learning and the answers to their own guesses, to share.
1) Q: Since Opencapi is so excellent, is it necessary to upgrade after CAPI?
A by YXR: Guess, CAPI is IBM-led, tied to OpenPOWER, and Opencapi is independent of CPU Isa. IBM can focus on OpenPOWER architecture and move forward independently, without worrying too much. Of course, there's a person who Opencapi is CAPI3.0, so it might be replaced.
2) Q: Why CAPI need to put PSL into the accelerate side, also made of IP. The PSL should contain the cache, which is responsible for the cache coherent together with the CAPP. So the real difference with OPENCAPI is, cache and address translation is placed on the CPU side or the accelerate side, why CAPI no such consideration, will the PSL on the CPU side?
A by YXR: Guess, one if the PSL is placed on the CPU side, causing the chip area to become larger due to the cache, and from accelerator as the peering CPU point of view, the cache should be closely followed by the CPU, the efficiency is increased significantly (this is the meaning of the cache). Accelerator only access the local cache can also avoid the problem of PCIe round trip latency too large. Of course, once the access latency of the OPENCAPI physical link (with BlueLink) is low enough, this allows the cache to be placed on the CPU side without compromising performance.
This leads to the third question about CCIX, which uses PCIe as a physical line and does not know how to avoid the problem of too much delay!!!
3) Q:capi and CCIX are using PCIe as the physical line, so latency will inevitably be large, ccix how to overcome? On which side is the CCIX cache? What is the difference between the two and their merits?
4) Q:opencapi 3.0 did not implement the cache coherent, but only to achieve memory coherent? What is memory coherent?
A by YXR: This surprised me, always think that coherent is the cache coherent. Here the so-called memory coherent may be cross-frame computer clusters (such as Hadhoop), memory how to maintain consistency of the meaning of it
Post-planning
1) familiar with OPENCAPI protocol level, especially TL, focus on how to put the host side in PSL, complete the cache coherency
Focus on the differences between their protocol interfaces and CAPI
[DREAM1] Zaius is a dual-socket platform based on the IBM POWER9 scale out CPU. It supports a host of new technologies including DDR4 memory, PCIE GEN4 and the Opencapi interface. It ' s designed with a highly efficient 48v-pol power system and would be compatible with the 48v Open Rack V2.0 standard. The Zaius BMC Software is being developed a using Open BMC, the framework for which we ' ve released on GitHub. Additionally, Zaius would support a PCIe Gen4 x16 OCP 2.0 mezzanine slot NIC
Brief discussion on CAPI and its use (4)