Cache Coherency Sanity Check

So I'm thinking about porting a multi-threaded program I'm working on (an interpretter for a dataflow language if you're curious) to the 32X and/or Saturn and I was trying to think of a sane way to use both processors in a reasonably efficient fashion. I think I have a reasonable solution, but I'd like some feedback to make sure I haven't missed anything obvious. So here it goes:

Threads locked to the processor they started on (I'll probably stick with a single thread per processor for this particular application)

Code and stack accessed through cached memory region

Globals and heap accessed through non-cached memory region

The logic behind this being that code is read only and therefore we don't have any coherency problems there. Each stack will only be touched by the single thread it belongs to and since threads can't move between processors only one processor will ever look at a given stack. Globals and heap on the other hand are potentially shared by all threads and manually flushing everything is going to be difficult to do properly and probably not very performant (apart from some special cases where the data is mostly read only, but those I can handle as exceptions if there's enough performance to be gained).

Will this approach work? Does it sound like a good compromise between performance and complexity?

I also posted this over on the Spritesmind forum, but I figure you Saturn devs would have more experience dealing with the SH-2s unfortunate lack of cache coherency.
 
It is a pure software problem to ensure, that both cpus don't access the same memory region at the same time (in a manner not intended, avoid both writing to the same memory). you will have to implement some handshaking. it is best when you know exactly, which data is accessed by which cpu. at special points, you could synchronize the cpus, e.g. perform variable update.

So globals and heap don't need to be accessed cache-through in general. i would recommend to use cache-through access only for very few dedicated variables. Since you synchronize only a couple of times per frame, you could clear the cache lines on the reading cpu (CSH lib of SBL lib), or clear the whole cache (slCachePurge() or something of SGL) when synchronizing.

In general data and message traffic between cpus should be as small as possible. the SGL treats the slave as slave and let's it execute a whole function without any handshaking afterwards. internally, it uses the free running timer for handshake, this doesn't cause any memory traffic. the function and arguments are passed through memory, it's accessed via a cache-throught pointer.

i have been using the SGL function slSlaveFunc() in the voxel demo, the texture coordinate stuff and the gbc and snes emus. based on some sbl dual cpus, i have reimplemented and extended the SGL handling of the slave for the SGL replacement. you can have some code, if you like.
 
Accessing shared globals via uncached memory is not a bad option. Use GCC's attribute extension to assign them to one segment with a virtual load address in the uncached region.


For the heap, it depends on the data sizes and access patterns. If the data is small, then uncached is probably the way to go, especially if both cores will frequently write to the same data structure. When the data sizes get bigger it'll be more beneficial to invalidate the cache once a core gains access to the structure. The cutoff point can be discovered via experimentation, and on the Saturn will be different for workram-H and -L.


When going for maximum performance it's also worth remembering that cache invalidation and pulling "read-once"-type data into the cache will lower the cache utilization for the rest of your code and data. For best performance you'll need to analyze the code and remove all unnecessary synchronization. Also keep an eye out for inadvertently passing the address of stack objects between cores.
 
antime said:
Accessing shared globals via uncached memory is not a bad option. Use GCC's attribute extension to assign them to one segment with a virtual load address in the uncached region.

For people who don't know how to do it that way (in the link script, like me), it works also by having a pointer pointing to the address of the variable. That pointer can be made cache-through by just adding 0x2000?????00.
 
It's safer/more correct to do
Code:
uncached_address = (address & 0x1FFFFFFF) | 0x20000000;
This way you can't do the addition twice by mistake.


However learning the link script syntax is worth it IMO. First, if you discover that a variable isn't shared you can change it to be cached just by modifying its attribute, rather than finding all changing all code that accesses it. Second, grouping all uncached variables together in memory means they won't pollute the cache just by sharing a cache line with a cached variable. Third, you will need to learn to use these features if you want to use overlays, run code from on-chip memory etc.


(This doesn't apply to the Saturn, but some cache types write back an entire cache line into memory at once. In these cases the uncached variables must be placed separately from the cached ones, or you may write back stale data. I don't know what core the 32x uses.)
 
RockinB said:
It is a pure software problem to ensure, that both cpus don't access the same memory region at the same time (in a manner not intended, avoid both writing to the same memory). you will have to implement some handshaking. it is best when you know exactly, which data is accessed by which cpu. at special points, you could synchronize the cpus, e.g. perform variable update.

So globals and heap don't need to be accessed cache-through in general. i would recommend to use cache-through access only for very few dedicated variables. Since you synchronize only a couple of times per frame, you could clear the cache lines on the reading cpu (CSH lib of SBL lib), or clear the whole cache (slCachePurge() or something of SGL) when synchronizing.

If I was writing a piece of software from scratch for the Saturn I would certainly take this approach, but in this case I'm thinking about porting a piece of software I wrote with modern multi-processor/multi-core setups (which have hardware cache coherency) in mind to the Saturn. It's mostly just an interesting exercise. One of the design goals of my little programming language is to simplify the utilization of multiple cores/processors and I thought it would be interesting to see how well it would handle the rather odd setup in the Saturn/32X. Of course, since the current implementation is just an interpretter the performance will suck compared to C code running on even one processor, but I can still compare performance of the interpretter running on one processor (with locking removed and nothing accessed via cache-through area) vs. two.

i have been using the SGL function slSlaveFunc() in the voxel demo, the texture coordinate stuff and the gbc and snes emus. based on some sbl dual cpus, i have reimplemented and extended the SGL handling of the slave for the SGL replacement. you can have some code, if you like.

I appreciate the offer, but I'm a masochist and like to target the bare metal. I have implementations of most of the parts of the C library that I use, with the notable exception of the file functions, for my Sega CD port (though my malloc/free implementation is crappy, I really should write a nice slab allocator).

Use GCC's attribute extension to assign them to one segment with a virtual load address in the uncached region

So you can give a variable a cusotm attribute and use that to put it in different parts of memory? Nifty. I don't think I'll have much reason to use it though. There aren't a lot of globals in the code and I'd be hardpressed to think of any that would really benefit from being cached given the Saturn's setup. I'll probably just set up my linker script so that the .data and .bss sections are accessed via the cache-through region.

and on the Saturn will be different for workram-H and -L.

I'm assuming that's because access to one of those is faster than the other? Which is which?

When going for maximum performance it's also worth remembering that cache invalidation and pulling "read-once"-type data into the cache will lower the cache utilization for the rest of your code and data.

That's part of what got me thinking along these lines in the first place. Just the code itself is easily much bigger than the cache (my Sega CD port currently compiles to about 36 or 37K with -O0) so my hope is that some of the performance loss of heap data being uncached will be offset by better cache utilization for the code itself.

This doesn't apply to the Saturn, but some cache types write back an entire cache line into memory at once.

Yeah, write-back cache without hardware cache coherency would be a total nightmare.

I don't know what core the 32x uses.

I'm pretty sure the 32X SH-2s use write-through cache just like the Saturn. I was under the impression that the 32X uses the same part (or at least a very similar one) just clocked a little slower.
 
Mask of Destiny said:
I'm thinking about porting a piece of software I wrote with modern multi-processor/multi-core setups (which have hardware cache coherency) in mind to the Saturn.
Strong coherency and ordering as provided by the x86 is really the odd one out. Most architectures have much looser memory models both for performance and implementation simplicity. Eg. the PPC architecture manual says that even if a memory page is marked as requiring memory coherency you still have to explicitly execute a "sync" instruction until the results are visible to other CPUs.


So you can give a variable a cusotm attribute and use that to put it in different parts of memory?
Yes, and it works with functions as well. Look for "Output section attributes" in the ld manual.


I'm assuming that's because access to one of those is faster than the other? Which is which?

Workram-l uses DRAM and workram-h uses faster SDRAM. That is also why SCU DMA only works with workram-h.
 
antime said:
Strong coherency and ordering as provided by the x86 is really the odd one out. Most architectures have much looser memory models both for performance and implementation simplicity.

I probably should have said modern desktop systems (which are almost all x86 systems these days). I can't think of how you'd manage SMP on a modern CPU without reasonable cache coherency support.

Eg. the PPC architecture manual says that even if a memory page is marked as requiring memory coherency you still have to explicitly execute a "sync" instruction until the results are visible to other CPUs.

Even on x86 you need to use one of the fence instructions to make sure that all pending writes have actually occurred, but that's usually only an issue when you're implementing locking primitives or trying to do atomic updates. Those kind of limitations are annoying, but at least they can be managed by your locking primitives. It's having to manually flush cache lines that my code really can't handle.

Yes, and it works with functions as well. Look for "Output section attributes" in the ld manual.

Good to know. Might come in handy for Sega CD work as well.

Workram-l uses DRAM and workram-h uses faster SDRAM. That is also why SCU DMA only works with workram-h.

Also good to know. Thanks!
 
Mask of Destiny said:
I can't think of how you'd manage SMP on a modern CPU without reasonable cache coherency support.
It doesn't really require more than invalidating the cache after you've acquired the lock that protects the shared resource. If you use a language that's designed with multithreading/multiprocessing in mind it can even be done automatically. It's only really the C model that's problematic, but it has all kinds of problems with the modern computing environment anyway.


Those kind of limitations are annoying, but at least they can be managed by your locking primitives. It's having to manually flush cache lines that my code really can't handle.
As long as all shared data protected by one lock can be packed into a single object it shouldn't be a big task to write locking functions or macros that do the flushing automatically.
 
antime said:
It doesn't really require more than invalidating the cache after you've acquired the lock that protects the shared resource.

Not if the processor uses a write-back cache, which AFAIK is pretty much the norm for processors that are likely to be used in an SMP setup these days. The "distance" between the processor and memory has gotten too great for write-through to be reasonably performant anymore.

It's only really the C model that's problematic, but it has all kinds of problems with the modern computing environment anyway.

Indeed.

As long as all shared data protected by one lock can be packed into a single object it shouldn't be a big task to write locking functions or macros that do the flushing automatically.

It gets complicated if the object has pointers to separately allocated heap objects/structs/arrays. I suppose you could require that the separately allocated object has it's own lock, but that's not necessarily a good idea from a performance point of view. Locking overhead is enough of a problem as is.
 
Back
Top