Saturn SCU DSP Notes

Ponut

Gear Supporter
I wanted to make this a formal write-up summary, going over the uses of the SCU-DSP & the things I've bothered to do with it, in detail. However, I'm running short on time in my life so I'll have to make it quick.

First: Should you be reading this? What is your level of understanding?

This is a question that I start with just to try and touch base with you, the reader. I am going to talk about computer science stuff: architecture, pipeline, bits, bytes, data types, memory bus width, even/odd addresses, and pointers.
If you do not have a computer science background, this post will not be useful to you. Either close the tab now, or read on with caution.
And even if you do have a computer science background, I have a further filter for you:

Do you know what these are expressing? Don't have to answer in the comments, just think about it.
If you get it, that's good. You have a level of C mastery that means you might actually be able to make use of the DSP.

void function(void(*thing)(void))
unsigned int notiCommand = 0x4A<<25 | ((unsigned int)dsp_noti_addr)>>3;

I am being pedantic about this just because I want to make a point: the SCU-DSP is difficult to use.
It is so difficult that fine-tuning your program written for the SH2s in C is going to net you a far greater gain than learning the SCU-DSP.
I would even argue that even at the point where your code is logically where you want to be, messing with compiler flags is a better use of your time.

This of course assumes you just want to "GSD", as it were. If you want to learn something unique and fun in programming, by all means, go ahead.

Second: What is the SCU-DSP?

The Saturn actually has two processors referred to as "the DSP". There is the DSP inside the Saturn Control Unit / System Control Unit (SCU), and the DSP inside of the Saturn Custom Sound Processor (SCSP). These two processors are COMPLETELY DIFFERENT. They do not function even remotely the same way. It is extremely important to specify which DSP you are talking about. Today, we are talking about the SCU-DSP. The SCSP-DSP is a piece of hardware covered by a different discipline, that being a digital audio engineer.

Capture.JPG

What you see above here is a logical diagram of the SCU-DSP.

...

Yes, that's really it. That's the whole thing. I'm not kidding, it is that simple. Some of you of course are saying, "What?! SIMPLE?!"
Well, trying to fit the logical diagram of something as slow as a M68K onto a single sheet of paper is a challenge. Go look up a diagram of one; it's probably going to have to span multiple entire pages. I think a bigger note is that the manual for a 68K might be hundreds of pages whereas the SCU-DSP is a part of the SCU manual taking up 86 pages. The point of this is that while using the SCU-DSP is complicated, it is not because the processor is complicated. It just requires intricate management by the programmer to function.

The core feature highlights of the SCSP-DSP are as follows:
A 48-bit MAC unit and ALU (Accumulator) unit with the processor in total running at about 12.7 MHz.
A simplified RISC-like instruction set where one instruction is executed in one clock cycle.
1KB of program RAM, at four bytes per instructions, allowing a single program to be 256 instructions (program changing covered later).
Technically 1KB of on-chip ""RAM"". This however is not RAM, it is indexed access memory, split across four banks of 64 x 4 byte entries.
A very wide instruction bus, with the capacity for a single 4-byte instruction to execute five simultaneous actions.
A "pipeline" wherein the instruction after the current one being executed is pre-loaded and prepared for execution-in-sequence.

Due to the extremely wide instruction bus, the SCU-DSP can only be programmed for in its own assembly language.
This is not Assembly like you will see for most of any other processor, however, if you've worked with a 56K before, it's probably familiar.

Here is a primitive program example written using the SCU-DSP's Assembly, highlighted via a language plugin for notepad++ to color-code the various instruction bus of this processor.

cap1JPG.JPG


What you will notice in this primitive (which should run in the DSP simulator), is a conditional jump. Unlike the SCSP-DSP, the SCU-DSP has a feature of critical usefulness and that is a number of logic flags and conditional instructions. Of course, you probably haven't looked at the SCU-DSP's instruction set yet, so that screenshot doesn't make sense. I am just using it to highlight that the SCU-DSP is a fully functional logical processor. In other words, this chip is more like a CPU than a DSP.

I will now insert a link to Antime's Saturn page, where you can find the SCU-DSP manual.
I have no idea where the hell I found these, but they could be on Antime's page too, but these are DOS tools.
These tools (being the dsp-simulator and the dsp-assembler) are needed tools to test and assemble DSP programs.
They must be run in DOSBox. Also, you'll need their respective manuals. (See Attached Files)

Okay, those are the basic footnotes about the SCU-DSP. I will not cover its explicit architecture as GameHut already did a video on that. I will also attach a file that is a sample program demonstrating the SCU-DSP's logical capabilities in a program that allows the DSP to divide numbers using recursive logic and a root seeking algorithm (i just googled the method, nothing special). If you intend to understand how it actually works & write code for it, bouncing back and forth between sample code and the manual whilst writing your own program is more helpful than what I could cram in here.
 

Attachments

  • DSPSIM.zip
    75.8 KB · Views: 144
  • unlimit_no_dsp.zip
    4.1 MB · Views: 153
Last edited:
Except for a few things...


Another major complication regarding the use of the SCU-DSP is its RAM access. That is, its lack of access.

It cannot be understated that the SCU-DSP cannot directly address memory from its own instructions. It can select a memory index from its own memory blocks, which take effect on the next instruction. Any instruction regarding moving to/from its memory banks' regard which memory bank to choose from and whether or not to index that memory bank (MC,X or M,X).

<- edited out misinformation , see fafling reply ->
A quick aside, covered in fafling's reply, is that HWRAM addresses are only going to be 27 bits long, byte-wise. DSP-DMA addresses on a four-byte boundary. To turn a 27-bit bytewise address into a four-bytewise address, you have to divide it by four, or shift it right twice. That results in a 25-bit value. This should fit in the SCU-DSP's move-immediate instruction which can handle up to 25 bits.

Sega's own manuals seem to indicate that using <mvi hwramaddress,RA0> is a functional method of directly moving a HWRAM address to the DMA address registers (RA0, WA0). However, I have tested it and can confirm this *does not work*. The reason this doesn't work is because the move-immediate instruction is for signed 25-bit data. In moving a signed 25-bit value to the address register, it sign-extends the 24th bit of your address out to bits 25-31, resulting in an incorrect address being used.

cap9.JPG


Here are screenies of the test scenario:
craig.JPG

Let's not pretend that anyone can understand exactly what is happening from such isolated code snippets. The point is that the address, from SH2, is being put into the command as shifted right twice to be put in as a 4-byte aligned address. Then the DSP manipulates it into RAM3 58. This weird procedure is followed to put it in RAM3 58 instead of inputting 'MVI NOTI,MC3' because changing a DSP program in a way that removes or adds instructions is way more difficult than just replacing a 'sl' with a 'nop' on a line.

The 'working' variant is pictured as followed (and makes the weird procedure make more sense):
craig.JPG


The scientists' of you will no doubt have already noticed that I did not, in fact, use <mvi hwramaddress,RA0>.
The logic here is the same. The MVI instruction will sign-extend a 25-bit address into a 32-bit signed integer. That won't crash the DSP, but it's also not the address you intended to use.
To get a valid address for DMA, you can follow the pictured procedure of passing in a 24-bit value and then shifting it left once to produce a 8-byte aligned address at the valid bit depth for the DSP to use for DMA. Alternatively, you can pass in the 25-bit address and follow a different procedure to mask out the bits you do not want to be high. Just be aware that the shifting instructions on the ALU also sign extend, and there's otherwise no immediate way to generate 25 high bits (since, you know, the move-immediate instruction will sign extend that). The other way is of course to use your DSP control library to write in an address that has already been converted by the SH2 to the DSP's memory blocks. If you do that, just be aware that ANY access to the DSP's memory blocks will increment the memory counters. You have to keep very explicit control of the memory counters regardless (CT0, CT1, CT2, CT3).

The reason why I point out this 'complication' with memory addresses and the DSP is as it relates to the organization of the game software as a whole. The DSP does not have enough program RAM to itself to control an entire 'pipeline' of your software on its own. This means, no matter what, the SH2 must ordain and allocate the memory that the DSP is allowed to use in HWRAM in order to communicate with the DSP program the size of its workload or perhaps even where to look for the address of where its workload even is. In my own software, this has a two-part solution. In the first part, the DSP program starts with a 'header' which, as written, starts with dummy addresses that the SH2 will later change the program before it is loaded to the DSP to the actual 8-byte aligned yet 25-bit addresses that the DSP will use. The reason I didn't just pass over that data into the DSP's RAM banks is because, for whatever reason, that didn't work--I suspect because upon system reset, the DSP program must be loaded and run a procedure to set the CT's to a known state before data can be safely loaded as system reset (not a fresh power-on, but a reset) may leave the DSP's memory bank access counters in an unknown state. So rather than mess with that, I decided to just write-in the addresses I want to use into the instructions that the DSP will run itself. This was easy enough to do since the SCU manual lays out the instructions bitwise.


Oh, and if it wasn't obvious already, your DSP program must be loaded by the SH2 from HWRAM following a specific sequence. SBL or Yaul can take care of that sequence for you.


P.0: The DSP part
Capture.JPG


P.00: The C / SH2 Part
Capture.JPG



The SCU-DSP has DMA access to HWRAM and the B-Bus. Its internal bus is 32-bits and its addressing is 32-bits / 4-bytes aligned.

It can't access LWRAM (sad horn). The SCU-DSP also has an issue regarding access to the B-bus:

A couple of things come together to mean that the SCU-DSP will only be able to access the first two bytes of every four bytes on any B-bus address. This is because the B-Bus is a 16-bit bus and the SCU-DSP' can only address on a four byte alignment. The manual does have a procedure that indicates pulsing the DMA in succession will have the SCU correct the DMA such that the second DMA pulse will read to/from the the other two bytes, but I wasn't able to get that to work. Other developers have said that did work, but they otherwise ran into another critical issue: SCU-DSP triggering a DMA to a B-Bus address has a random chance of just, you know, crashing the system. Either crashing/locking up the DSP, or hard-locking the Saturn. There is probably an untraced condition on which this will or will not happen, so if you are using Yaul, you may not run into that.

cap8.JPG


A final note on memory issues with the DSP is that the Master SH2, Slave SH2, and SCU-DSP all end up fighting for access to HWRAM. When bus contention for all of them add up at once, the delay for either SH2 may be as long as 30 cycles before the SCU has released the bus for the SH2s to use due to the strangely slow behavior of SCU-DSP-DMA. Though, it can be easy enough to schedule the time that the SCU-DSP is accessing RAM to fall outside of the times either SH2 is heavily accessing RAM, especially if you are aware that the SCU-DSP should spend as little time as possible using its DMA (which means, as little data as possible).


Part Four: What good is it then?


Objectively, what can the SCU-DSP do that the other CPUs in the Saturn can't?
Oh, but before I tell you that, let me remind you that the SCU-DSP has no arithmetic division operation.
However, with fixed-point numbers, you can manage a division by multiplying by a fraction using the MAC unit.
It also has both left and right shifts, both rotation shifts and normal shifts.


To be blunt the SCU-DSP does not give the Saturn any additional features. It only exists to add extra MIPS.


The SCU-DSP has two main advantages that, depending on the code chunk you are speaking of, can end up being a break from a CPU-bottlenecked game.


The first advantage the DSP has to do with logical branches: the SCU-DSP does not pipeline stall on a logical branch; it executes jumps in one cycle like any other instruction. Remember that the SCU-DSP has a tiny little 'pipeline' where it pre-loads the instruction after the current one being executed, this even counts for jumps, both conditional and not. Depending on the circumstances, you can either lose 1 instruction to a jump (for not wanting or having no use for the instruction slot after the jump) or lose nothing. Usually though, you end up losing 1 instruction because of how strict memory control must be.


This is an advantage compared to the time loss that the SH2 and M68k experience from logical branches. The SH2 will often experience a total pipeline reset upon a branch cut, which can cost up to 10 cycles in and of itself, not including the execution states of the jump instruction and the instructions within the regeneration of the pipeline. The DSP is clocked at half the rate of the SH2 though, making a comparison being an average of 4 SH2-equivalent cycles lost on the DSP to 10 cycles lost on the SH2. The M68K is dramatically slower in comparison; a branch on the 68k might take 10 cycles, but said 68k is running at a comparable clock-rate to the SCU-DSP.

cap5.JPG


cap6.JPG


It should be noted of course that an SH2 expert is going to understand what exactly the SH2 loses on a branch cut a lot more than I do. And that's why I say that studying & optimizing your SH2 code is going to get you further than the DSP will.
 

Attachments

  • hmap2.zip
    4.8 KB · Views: 133
Last edited:
The second advantage of the SCU-DSP is actually the reason it exists to begin with. Behold! The fully unlocked potential of the DSP's wide instruction bus!
cap3.JPG


This is an example primitive of what transforming a fixed-point vertex by a matrix might look like on the SCU-DSP.

This is also why Sega first included the SCU-DSP in the Saturn in the first place; it was intended to perform this task to enable the system to have the grunt necessary to be a truly 3D system. Most developer partners of Sega, and most developers at Sega themselves, knew the SCU-DSP for this purpose and this purpose only. If a retail game used the SCU-DSP, this is most likely what they used it for: matrix transformation.



Of course, the myth and the legend goes that after Sega learned of the Sony Playstation's specifications, they were alarmed. They knew the Saturn, at that point having only one SH2 and the SCU-DSP, could not compete with Sony's hardware. To improve the Saturn's power to be a better match for the PlayStation, Sega added a second SH2 to the Saturn. At that point, the SCU-DSP was obsolescent. It was also probably at that point that hardware development on the SCU-DSP was stopped; its feature-set showing that it was thought of as more than a maths unit... because a maths unit doesn't need logical branches or DMA access... but it wasn't developed to a point where it could serve nearly as well as a second SH2 simply due to how difficult it was to use & coordinate the rest of the program with. With its proclivity to crash the system and a few other bugs ( did I mention the DSP end interrupt, as triggered by ENDI, sometimes just doesn't work? ) I imagine the SCU-DSP was finalized right after the second SH2 was added. Maybe. I don't know.



What I do know is that the SCU-DSP is fast at MACs, since it can do a 48-bit MAC in one cycle (or two SH2-equivalent cycles). It takes the SH2 an average of three cycles to do a 64-bit MAC. But listen, seriously. The SH2 can do its own set-up for said MAC. It can also do a 64-bit division in-line with the MAC thanks to the SH7604's DIVU. The SH2 can also arrange such that it can work with 8-bit or 16-bit vertices. Yadda yadda yadda. The complication means that your Master SH2 is very likely to lose more than one cycle per transformed vertex in simply setting up the DSP to do its work. In fairness, the DSP does not need to spend instructions rearranging a fixed-point operation back into 32-bits, whereas the SH2 will end up using the ``xtrct`` instruction a lot.

cap7.JPG


The SH2 version of point-by-matrix:
cap4.JPG




Though, none of this much matters when the main end goal of the SCU-DSP was always to add more MIPS, not to enable some specific new feature. Because of this, it is often the case where making use of the SCU-DSP will improve the performance of your software in CPU-bound scenarios, no matter what the SCU-DSP is doing. As long as it is taking enough of a load off of the SH2s that one or two milliseconds are saved, that might be the one or two milliseconds you needed to make that 33.3ms frametime.



Of course, if you are not CPU-limited, the SCU-DSP will do literally nothing for you. The attached "dsp_bench" ZIP file demonstrates this. It contains two versions of a 'game' of sorts with an unlocked frame-rate. If you performance test them, you'll notice that they perform exactly the same, yet one version of the game is using the DSP whereas one version of the game is not using the DSP. This is because that build never runs into a CPU bottleneck. I do know of course that if I manufacture a CPU bottleneck, the DSP version runs faster, but those are tests that I ran long ago to come the conclusions I present here.



Part What: Program Switching



This is a short note. The SCU-DSP having only 256 instructions to be loaded in a single program is kind of a bummer. I thought it'd be dope if the SCU-DSP could run a program, load a new program at the end of that one, and continue running the new program that was loaded. And then that second program would re-load the first program and then enter a wait-state pending SH2 communication. I tried this, and it seemed buggy. Some emulators would let it run, but The Codex As- i mean Mednafen would not let it run more than a few times. Real hardware seemed to corroborate that the DSP would crash after running this loop back and forth a few times. Sometimes, it wouldn't run it more than once. It was confusing that it would sometimes work and sometimes not work.



My memories of it are foggy, since isolated test-runs with simplified programs (not running next to the real game) would not work at all, not even loop through once. I deleted all traces of it... I kind of feel like this one demands more study for an interested party, but really, does it? The DSP is hard enough to use as it is.



After reading the hardware manuals up and down for an explanation of why the DSP would stop itself after loading a new program, it seems this is actually the intended behavior, since loading a program into program RAM is supposed to always halt the PC. These rules are kind of implied by the DSP control port.

Of course, you have sections like this (which I definitely read & tried)

cap10.JPG


It's confusing, because it's like Sega intended you do to this, but don't tell you that the program stops after you load it. I suggest you research further if curious.

Conclusion

In an alternate universe, there exists a Saturn that released without a second SH2, in this same universe a 32X did not release. In this universe, Sega contracted with Motorola to build the SCU instead of Yamaha (or was it Hitachi?). In doing so, Motorola was able to include a 56K inside the SCU. In this universe, the Saturn was able to do this but at an even higher frame-rate with the added support of VDP1:



Of course, said Saturn cost a lot more to manufacture.


/e: i got the clockspeeds wrong

We don't live in that universe, and frankly, we should all be thankful we do not. As we will soon learn this coming January, the Saturn does not need a DSP to do that, if that wasn't clear already. Sega's inclusion of a second SH2 as a reaction to the PlayStation was successful if short-sighted because what Sega needed more than anything was a Saturn that was easy to use. Two 28.6 MHz CPUs may not be as good as one 57.2 MHz CPU, but they sure as hell are better than one 28.6 MHz CPU and a 14.3 MHz enigma machine. And we got two 28.6 MHz CPUs, and a 14.3 MHz busted enigma!
 
Last edited:
Very interesting read @Ponut !
Another issue is that the immediate-data instruction can, at most, host a 25-bit number. This is a problem, because a DMA instruction takes a 4-byte aligned address. If you divide the total Saturn memory map by four (>>2), you get the addressing in terms of four-bytes aligned. That ends up being 30 bits; five bits more than the immediate-data instruction of the SCU-DSP can express. Unless you calculate a specific hard-coded address, or calculate addresses from an offset, a known address can only be known by the SCU-DSP on a 32-byte boundary instead of a 4-byte boundary
RAM addresses fit on 29 bits on Saturn. The 3 extra bits are used for the access space by the SH2s, but they're useless on the SCU.
1665486817625.png

And in fact, you can address the whole usable RAM space on Saturn with just 27 bits, as the highest address you need is the end of HWRAM at 0x60FFFFF. That's why the SCU DMA level 0-2 start address 32 bit registers only require the 27 lower bits to be set.
Since the SCU DSP DMA start address must be aligned on 4 bytes, 25 bits are enough to address the full usable Saturn memory space.
Two 25.4 MHz CPUs may not be as good as one 51 MHz CPU, but they sure as hell are better than one 25.4 MHz CPU and a 12.7 MHz enigma machine. And we got two 25.4 MHz CPUs, and a 12.7 MHz busted enigma!
A fine conclusion, however it seems you're trimming a bit on the frequencies : SH2s run at 26.8 or 28.6 MHz, so the SCU DSP runs at 13.4 or 14.3 MHz.
 
Since the SCU DSP DMA start address must be aligned on 4 bytes, 25 bits are enough to address the full usable Saturn memory space.

Should I edit that part? So you know, memory fuzzy and all.
The move immediate instruction is signed.
I did see in my code I only need to shift it left once to get the address I want.
Why didn't I just pass it through as if it were unsigned? I don't remember lol

EDIT: I remember now. I will edit the original post with an explanation.

Anyway, which processor does run at 25.4 mhz? i'm guessing none of them do
 
Last edited:
A couple of things come together to mean that the SCU-DSP will only be able to access the first two bytes of every four bytes on any B-bus address. This is because the B-Bus is a 16-bit bus and the SCU-DSP' can only address on a four byte alignment. The manual does have a procedure that indicates pulsing the DMA in succession will have the SCU correct the DMA such that the second DMA pulse will read to/from the the other two bytes, but I wasn't able to get that to work. Other developers have said that did work, but they otherwise ran into another critical issue: SCU-DSP triggering a DMA to a B-Bus address has a random chance of just, you know, crashing the system. Either crashing/locking up the DSP, or hard-locking the Saturn. There is probably an untraced condition on which this will or will not happen, so if you are using Yaul, you may not run into that.

There could be an explanation for the B-bus access issue of the DSP (and maybe for the related crash) found in p. 48 of Sega Developers Conference Conference Proceedings March 5 7, 1996 : Sega Developer Technical Support : Free Download, Borrow, and Streaming : Internet Archive :
1670624592367.png

So the assembler would be to blame, and you'd have to patch its output result to make it work.
 
A+ level digging there, @fafling ! That's sure to help some folks out.

Editing a single instruction is not too difficult, the manual lays them out byte-wise.
 
I would like to come back to this thread and report a few things.

Firstly that I have wrote a new DSP program which achieves great results in improving the performance of a 3D game, on the order of ~4ms of improvements for 441 vertices tested. The program uses a chirality (winding) check algorithm to see if a vertex is in an on-screen space, or not. It can then apply user-specified clip flags whether IN or OUT of the area, to achieve portal IN (window) or portal OUT (occlusion).

However, in writing and integrating this program, I have gone back and discovered that the SH2 code (particularly the Slave SH2 code) also had issues which were causing frames to miss the 29ms target (yes, 29ms, harsh). So it goes to show that you're going to need to work on profiling the code to grasp performance issues before proceeding with the DSP to try and improve things.

Another issue was that while theoretically the DSP code would provide a 6-7ms performance boost, synchronization issues mean that significant time is lost over what might be theoretically possible; this is an issue that will always occur when using the DSP to perform a task in time-step parallel to the SH2. To be fair it exceeded my expectations after the synchronization code was entered.

In addition to that, I was able to get things done a lot faster thanks to "The Purist of Greed"'s / @buhman new Windows-compatible DSP Assembler, that you can find here:
Very big thanks to that.
 

Attachments

  • winder.zip
    10.6 KB · Views: 64
However, in writing and integrating this program, I have gone back and discovered that the SH2 code (particularly the Slave SH2 code) also had issues which were causing frames to miss the 29ms target (yes, 29ms, harsh). So it goes to show that you're going to need to work on profiling the code to grasp performance issues before proceeding with the DSP to try and improve things.
Perhaps interrupts have not been disabled for the Slave SH2. The Slave SH2 interrupts are described in "Sega Saturn technical bulettin #28".
 
Back
Top