IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
Meet the experts: Alex Chow on Cell Broadband Engine programming models
skip to main content

developerWorks  >  Power Architecture technology  >

Meet the experts: Alex Chow on Cell Broadband Engine programming models

Please do feed the models

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Power Architecture editors, developerWorks, IBM

22 Nov 2005

A critical component of programming for the Cell Broadband Engine™ (Cell BE) processor is understanding the workload in order to choose the right programming model. Alex Chow of IBM recently proposed several programming models ranging in complexity from a small single SPU to a large interconnected multi-SPU program. developerWorks talks with Alex about some of the programming models he proposed.

Alex is a senior programmer and a software development manager in the IBM Cell Processor Design Center in Austin, Texas, where he leads a team developing workloads, libraries, demo, and samples for the Cell Broadband Engine (Cell BE) Processor chip bring- up. He is also the author of "Unleashing the power of Cell Broadband Engine: A programming model approach" (see Resources), and it is that same topic -- programming models for the Cell BE -- that is the subject of today's interview.

developerWorks: Alex, thank you for joining us. Let's start with a few of the terms from your Fall Processor Forum presentation. Now is CESOF a sub-format within the ELF section?

Alex Chow: It is an application of the ELF [specification], meaning it does not change the ELF specification itself. It simply makes use of the ELF specification to achieve some of the things that we need. But because of the usage, we need to standardize how that is used so that all other things built on the same convention can interoperate. So we made sure that CESOF is a usage convention rather than a change of the specification to avoid much change in the existing toolchain.

dW: The current way that the documentation seems to keep pointing to is that the SPE executable stuff is embedded within the ELF as a section. Is there any model development towards having the SPE executable as its own stand-alone file that can be read in by a framework application?

Chow: Yes, we do have that. We have a runtime environment that can take the SPU stand-alone executable as-is and run it on an SPU. We call such stand-alone SPU executable, SPUlet.

dW: Is that built on top of, like, the SPUFS model (See Resources) that was developed in Boeblingen?

Chow: SPUFS is a resource representation model that the kernel uses to manage SPU-related resources. Though the representation provides a clean boundary between what user-level applications can do and what kernel does, it is not convenient for an application programmer to use directly. Our additional runtime environment provides two different models for the Cell programmer. One is the SPUlet model. A programmer simply creates an SPU executable, and our runtime environment can load it to an SPU and start its execution. In this case, the SPU executable is a stand-alone file. It does not need to be included in a CESOF file. Another model is called SPE-thread model. It is also built on top of the SPUFS representation. A programmer using the whole Cell resource usually programs to this SPE-thread model that allows the programmer to manage more than one SPE program or thread. We use CESOF to piece together these different SPE programs and PPE programs into a single executable image.

dW: And another term that I noticed you used in your paper was "double-buffering." I'm aware of the use of the term "double-buffering" in video composition, but it is also the correct term in this application?

Chow: It is a term and technique frequently used by PS2 game programmers. Of course, there is a similar concept in video composition. Because the DMA engine and the CPU can operate independently, we start the data transfer of a new piece of data while we are processing the old piece of data. In the video composition term, we are drawing the next frame while we are showing the previous frame. A similar technique used by the MPI programming is called "prefetching."

dW: I see a lot of "mailboxes," references to input and output "mailboxes." Is that basically a DMA address?

Chow: No. The mailbox is a special set of registers of SPE and PPE. Each SPE can communicate with the PPE through these registers. A PU program can simply send or receive 32-bits data to and from SPU . It can be used as a very simple synchronization mechanism between PPU and SPU programs.

dW: Is it polled or interrupt-driven?

Chow: Both -- the architecture can support both.

dW: Now in your programming models article based on your Fall Processor Forum presentation (see Resources), you propose increasingly complex models. You've got the streaming, pipeline and I/O data, which are basically at the same kind of level as far as approaches. Is there a danger that there's going to be a de facto programming model implemented and the others go by the wayside? Is there a de facto programming model?

Chow: There is no de facto programming model. We sort of experimented on different flavors of programming models. Each type of workload may use certain combination of models. We try not to force the programmer into one specific one. The architecture supports all different kinds of programming models. A programmer can decide one over the other considering development efficiencies and performance.

dW: But the programming model has to be supported by the operating system itself, correct?

Chow: No, only to a certain degree...

dW: No?

Chow: The current operating system exposes SPU resource through its SPUFS. So it's really up to the programmers to establish whatever additional models they want on the available resource. A high-level programming model may not need to be supported directly by the operating system itself. For example, the SPU runtime management library provides a higher-level SPU thread model on top of the SPUFS. One can similarly implement their own runtime library to support another programming model.

dW: Well then if we take your simple small single SPU model, who determines whether or not that SPU thread actually gets to execute on it? And on which SPU does it execute?

Chow: That's a very good question, and the answer depends on if you are a kernel programmer or application programmer. From a kernel programmer’s point of view, the kernel decides when to actually map an SPE task onto an available physical SPE. From the application programmer's point of view, the SPE thread is "conceptually" started after the SPE thread is created. Since the SPE thread model virtualizes the physical SPEs, the application programmer shouldn't need to know which SPU gets to execute a particular SPE thread.

dW: But, can the user still make the decision?

Chow: To certain degree, similar to mmap function, the user may suggest to the kernel a particular SPU intended by using a mask called the SPU Affinity Mask.

dW: Do the programming models require some pre-knowledge by the compiler?

Chow: The currently released compiler and toolchain do not have pre-knowledge of any programming model. Some other compilers, for example, openMP and Cg, define their own programming model through their languages.

dW: So I think the answer to the question is all components from the operating system to the developer and the compiler and the toolchain in between have to work together to support a particular programming model?

Chow: Yes, another way to say it is that if a programmer wants to use a specific program model to help him out with developing an application, he has to use all the pieces from all the components: SPUFS, SPE runtime management library, and CESOF provided there's enough low-level infrastructure for a programmer to establish his own specialized programming model.

dW: Let's say I'm writing an operating system that will run on a Cell processor. Can my operating system use something comparable to memory management functions or whatever to make it so that userland programs cannot directly access the SPEs, but have to make a system call to do it?

Chow: Yes. currently, that barrier [in Linux] is established by the SPUFS. The SPU file system represents the SPE resources to the kernel and to the userland programs. So, all the users access the SPE resource by operating on the SPU files. [Local store] is also represented in the SPU file. You can read and write to that file to access the local store.

dW: So there are no unprotected instructions that bypass all of this and write directly to the processors?

Chow: That's right. In addition, we have to use additional system calls. For example, to start the SPU program running.

dW: Looking at the multitasking model -- which has the pre-emptive scheduling -- when it comes to a context change, what's the order of magnitude versus some of the effects like cache misses?

Chow: The context switch for the whole SPE takes about 20 microseconds. The majority of the latency of the context switch save and restore is because you have to swap out the whole local store and then replace that with the new context. It is very expensive compared to the native PowerPC® threads. For the PPE core, you don't have to save and restore the whole execution memory image of the PowerPC task. To context switch an SPE program properly, a lot of activities need to happen.We have to wait for the outstanding DMAs to complete, copy out of all of the entries in the DMA queues, copy out of all the local store and register content... By the way, it is a huge register file. Then you have to bring in the previous context and other context to be swapped back in. It takes a relative long time to complete. That's the reason we usually favor the "run to completion" model.

dW: Okay. So an SPE executable that's less than 256K can be loaded entirely in the local store. That's going to be the least expensive if you can run it to completion, correct?

Chow: Yes.

dW: Okay, and then looking at a single, simple SPE executable that exceeds 256K, would that be your next most expensive arrangement or model?

Chow: Right. But this question is not exactly about context switch. If you run the large program to completion, it will still be the least expensive for that large program. The cost to context switch a small SPE program comparing to a large overlay SPE program is the same. Software caching and code overlay have their own overhead during their execution. But such overhead is not related to multitasking.

dW: It seemed like the pipeline model was heavily dependent on ensuring that you constantly had something going in. Will bubbles have an effect on performance?

Chow: Yes. The use of the pipeline elements has two things to consider: the data workload regularity of the streaming, and the data feeding regularity . If the data workload varies too much, then the job throughput will be bounded by the slowest stage in the pipeline. If you didn't keep the input streaming flowing, then a stage cannot put the data through in time, and you will have bubbles in the pipeline.

dW: So you mean, when completion times vary, then the model becomes bounded by the task that takes the longest to complete -- but in the meantime, the other SPUs are idle, having completed their tasks? or sort of like a big meal that takes a long time to digest?

Chow: Yes.

dW: So you have to have a very good understanding of what kind of workloads you need in order to determine what model you're going to use.

Chow: Yes, understanding the nature of the workload is very important.

dW: Can the pipeline model be emulated by the job queue model?

Chow: The answer is yes, and it's just an implementation statement. An interesting observation is that the pipeline programming model is not actually the easiest, nor a popular model to use. A programmer would have to connect the SPEs' input and output and stage the task properly. We did have some experiment on the pipeline model. But as far as I know, they were eventually transformed into other models. It is still possible that a particular problem may be more efficiently implemented by a pipeline model. Its data and its stages must be very regular in nature.

dW: Last question: a discussion that comes up with Cell a lot focuses on discussing flops -- gigaflops, teraflops -- benchmarks kind of stuff. On the Cell processors, the limitation for flops -- is it the system memory bandwidth or the SPUs themselves?

Chow: It depends on the nature of workload. In our experience, it really is different for different workloads. If an algorithm needs to take more time moving the data than processing the data, it will be bounded by the memory bandwidth of the chip. For example, the FFT [Fast Fourier Transform, see Resources -eds]. That's mostly bounded by the memory bandwidth. In many cases, we can use calculation to reduce the memory bandwidth requirement. In the large FFT workload, the twiddle factors are calculated in the SPE. This saves a significant chunk of the memory bandwidth requirement.

If you can squeeze a lot of computation on to a particular SPU and do a lot of calculation before throwing the result back out, then in those cases, we can achieve very high gigaflops.

dW: Alex, thank you very much for talking to us today.

Attributions

Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.



Resources

Learn

Get products and technologies

Discuss


About the author

The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBMPrivacyContact