IBM cellbe workshop content and highlights

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I had the honor to participate in this training of IBM, but many people were not aware of the training location due to IBM's own management mistakes. I participated in the two-day training session for one and a half days, and two IBM lecturers from Beijing took the round. Let's talk nonsense and start with the topic.

First of all, the cell architecture should be familiar with it. Even if you don't miss IBM, the PS3 hardware spec also says: PPE + 8spe, 9 cores, and 10 hardware threads. PPE is the legacy power5 architecture. It can be used as a power5 so that it is no problem and fully compatible. It has two caches: L1 and L2. Eight SPUs, including SPU vector processor, designed for vector processing and SIMD, 128-bit General registers (really many) and dual issue channel, very fast with no hardware cache design. The SPE also contains ls (256 k) for SPU local memory, latency is also very small, and MFC is used for DMA control. EIB acts as a high-speed bus between PPE and SPE.

then the development environment. Currently, the cell SDK is in version 2.1 and runs on Linux. Vivo said the release 3.0 will be held in September. Some of the strong optimization features of xlc and the SPE version of fdpr-pro will appear. Well, no matter how fancy the future version is, let's take a look at what we have at hand: toolchain and compiler, of course C/C ++, which correspond to two copies of PPU and SPU respectively, therefore, two executable formats are compiled. The compiler can use GCC or IBM xlc. It is said that later versions of xlc will have functions such as auto vectorization coming soon. For other tool chain items, refer to the set of Linux on general x86. The basic usage is of course the same. In addition, IBM engineers provide many profiling tools, such as static SPU timing tools, the compilation Code , sequence diagram, and fdpr-pro can be generated to optimize the binary code without human intervention, but currently only the PPU can be generated. To make it easier for everyone to take advantage of cell strength, IBM's lab in China also developed something called Alf (accelerated Library Framework, it is said that it can greatly simplify the job of Program staff to split workload. Can it be used as a script (or Gui? Forget), define workload, and then let the program automatically optimize the processing time. This is not mentioned in the training. It seems that you should pay attention to it. Finally, there are some Eclipse plug-ins and official cell simulators. This stuff will play a very important role in profiling at runtime, you can see the execution status of each cycle, SPU, and CPI (cycle per instruction. It is said that normal programs can be optimized to about 1.0, and abnormal computing can be up to 0.6-0.7) and Miss branching, dural issue rate and other important profiling data, but the disadvantage is that it is too slow. In the IBM classroom, I run the old P4 machine, Win2k, and then the fc6 on Vmware, run simulator in fc6, and run another Linux system. Even in fast mode, the simulator works very slowly. If it is cycle mode, it is said that the 5-second program can run for four hours.

The above are some of the things that IBM can provide to cell developers at present. Of course, you can also visit alphaWorks to see what new products are available.

Next, let's talk about how to make cell stronger. Basically, PPE acts as a program coordinator to schedule all the work of the SPE and prepare the data of the SPE. A large amount of data-intensive computing will fall on these eight SPUs. SPU is a high-speed vector machine. All commands are vector commands, and even scalar calculations must be converted to vector commands. What do you think? GPU, right. I personally feel that the cell is located between the traditional CPU and GPU, and the requirements for pipelining are not so abnormal, and there will be no shader standards for you to stick your hands to GP, however, general processor does not have the abnormal vector processing capability, which is flexible, but the efficiency may not be as high as that of GPU, after all, people are too professional (the folding at home on the PS3 is still better than the GPU version, but more powerful than the pc cpu version ). Oh, that's far away. In other words, SPU should be able to exert its capabilities mainly in two aspects. One is whether PPE can communicate effectively with the SPE, and the other is how the SPE uses vector commands and optimizes its ownAlgorithm. These two are both huge propositions. Here is a brief introduction:

There are three main communication modes between PPE and SPE: Mailbox, signal, and DMA. It seems that mailbox and singal are a bit similar. They are all synchronization objects such as semaphores, but they are hardware-level. Therefore, since these two items are divided, they must have different purposes. You need to know about them later. DMA is very intuitive. With the synchronization of mailbox and signal, DMA can transmit data with confidence. Since SPU can only perform ls operations, so there will be a lot of code involved in synchronization and DMA.

As for code vectorization, IBM's best practice is: if you are not familiar with cell and the algorithm to be implemented, we recommend that you first perform the scalar version on the PPE, and then you can port it to the SPE step by step, for example, first align the memory, then convert the scalar operation into a vector operation, and finally review the data independence of the algorithm, whether the scalar operation can be performed in parallel, and whether the scalar operation can be performed separately in other SPEs. Of course, if you are a master of cell, you can write it directly, but it seems that the master has not yet been born. Rapidmind also provides a very high-level solution, claiming that you do not need to consider the hardware architecture to Write multi-core programs, but it seems that IBM people are not a bird, still think their own alf is good. However, IBM also admitted that this was a big problem and could not be solved overnight (free lunch is over ).

So the above is basically the content of the first day, and the next day went to a morning, but the software programming model is indeed worth hearing. Here we will list several useful and common programming models. For example, streaming allows several SPUs to process unrelated data in parallel. When a single SPU has sufficient processing capabilities, pipelining and several SPUs are used to process related data in sequence, but load balancing is difficult, when the data correlation is large and the processing capability of a single SPU is insufficient, consider it. When executable or data exceeds the limit of K of LS, you can use manual DMA, overlay, or software cache library to make the gap between LS and mm transparent. Manual DMA control is more suitable for scenarios with strong data coherency. If the programmer knows a large block of data, it is best to use this method. The software cache is suitable for programmers who need random access to a large amount of MM data and have no knowledge about data coherency. The software cache can provide better cache support because the SPE has no hardware cache design. Once your code size exceeds the LS range, you must use overlay to solve the problem. You can use linker script to define the overlay region, but you do not need to use the dynamic source code, and you don't have to worry about overwriting the program space. There are also some other models, such as the SPU thread managed by the kernel, which is too far away from the Earth (but it seems like VxWorks is doing it). However, managing the SPU for cooperative multitask by yourself has practical application value.

Others also talked about the SPU static timing usage. The generated file can clearly understand the static timing. In addition to being too slow, it is useful to obtain the actual running data through the dynamic operation of the simulator.

Finally, I went back to my company and missed out on an MPEG2-> H.264 Real-Time transcoding streaming media system optimization instance. Buy a PS3.

A few days later, I found wenle and Pan Da are classmates ...... Khan.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

IBM cellbe workshop content and highlights

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

IBM cellbe workshop content and highlights

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support