Zhang Dong: OpenPOWER CAPIWhy so fast? (ii)
PMC Company Data center storage architect Zhang Dong
with the How does the CAPI FPGA work?
First recognize the three characters in the system:
AFU (Acceleration Function Unit), the main acceleration logic part is the fpag acceleration chip, the user can put their own acceleration logic and Firmware write it in.
psl-power Service Layer,provides an interface toAFUfor reading and writing main memory andv2pAddress Translation(with theCPUside uses the same page table and containsTLB), while also responsible forProbe CAPPImplement Globalcc, and provideCache. PSLbyIBMas a hard coreIPprovided toFPGAdeveloper.
capp-coherent attached Processor Proxy,equivalentFPGAside ofccagent, but was placed in theCPUside, which maintains aFilterdirectory and accept from otherCPUof theProbe, not filtered outProbeforwardingPSL.
The operating points can be briefly summarized as follows 6 points:
for dedicated scenarios, Optimized for PCIE dedicated accelerator cards;
the FPGA directly accesses the entire virtual address space of the current process without having to turn it into a PCIE address;
the accelerator card can be used the Cache and The Probe operation of the CAPP automatically and main memory cc;
Accelerator card and The CPU sees the same address space and cc;
provide API, including open device, delivery task description information, etc., equivalent to the driver;
psl ibm provide, hard core ip afu through opcode psl send and receive data.
< Span lang= "ZH-CN" > in this process, capi is committed to the fpga cpu cpu cpu fpga See is no longer pcie space, so the mapping address is omitted. And then fpga cache cache
now fpga has direct access to main memory space, but it does not access all physical space because CAPI 1.0 capi fpga CAPI 2.0 fpga 10 fpga fpga cache cpu
How much performance can be improved?
The hardware configuration is this:
IBM Power8 Server, s822l
Ubuntu, kernel 3.18.0-14-generic
Nallatech 385 CAPI card
Samsung SM1715 1.6TB NVM EXPRESSSSD
when testing,? the PMC engineer uses an FPGA to create a text search engine, such as.
During the testing process,the host side main program reads data from the NVMe SSD and generates a task description linked list. AFU uses pooling to access the main memory to get the task description list and perform search tasks,Snooper used for debug and performance monitoring.
Performance – P8<->afu
When the queue depth of time, get a limit throughput, close to 6gb/s bandwidth, bandwidth is very large.
Delay is also very small, only 1.5 microseconds, average 90% read and write in 1.5 microseconds completed.
things that CAPI1.0 can't do temporarily
the CPU thread now does not see the address space on the AFU (except for theMMIO Control register address). Moreover,AFU can only be used by one process. would it be faster if the FPGA could be directly plugged into the FSB of the CPU in the future?
Zhang Dong: OpenPOWER capi why so fast? Two