Resource icon

AVR: A technique to use the SCSP DSP to process 3D geometry and more. More power to the Saturn! 1.0

This is a project I have been meaning to do for a long time, since all the way back in 1996-1997 I heard the rumors that AM2 was going all-out and using ALL the Saturn processors to create an amazing port of Virtua Fighter 3 on Saturn.

This demo proves that in fact it could have worked to some extent, even with all the idiosyncrasies and limitations of the hardware and the method. The SCSP DSP MAC unit can be used to do some 3D work, and can that while the SH-2 does something else. What I did here (basic 3D vertex transformation) is just an example, which could be extended to do character skinning, lighting calculations, etc.



Music: "late late" provided by mobygratis

The demo you can download is the Akira character walking, for the YouTube video I edited the three examples I wrote for the library into a single video.

Some technical details, I will write a longer writeup when I have time:

Features and limitations of the SCSP DSP (this is a big simplification, I don't have time to write a comprehensive description now)
  • 128 instructions always executing per step/sample, 44100 times per second. It cannot be paused or stopped without stopping the whole SCSP chip
  • DSP can read and write Sound RAM memory from a configurable ring buffer, a block of memory
  • The memory location of the ring buffer can be changed anytime, even while audio is playing
  • The DSP can read and write memory only from addresses contained in 32 internal MADRS pointers
  • DSP memory accesses are either at fixed memory addresses (from the MADRS registers) or relative to a DEC register
  • The DEC register decreases by one at each sample (44100 times per second) and loops around based on the size of the ring buffer, this was clearly done to easily create effects like echo and reverb
  • The DEC register not visible externally at all, and cannot be accessed neither by the 68k nor the SH-2
  • The DSP MAC unit has an accumulator, a X value and a Y value
  • It can do ACC = ACC + X * Y (give or take)
  • X can be only from memory (give or take)
  • Y can only be from one of the internal 64 COEF registers (give or take)
How the technique works:
  • All vertex data needs to be preloaded into Sound RAM, and enough space needs to be reserved for it and for the output buffers that contain the transformed data
  • In order to stream the vertex data into the DSP, access to vertex data is DEC-relative, so that different memory locations are addressed at each sample, without having to change the values inside the MADRS registers all the time
  • Vertex x y z coordinates are therefore separated into SoA so that they can reuse the 32 MADRS registers, which are DEC-relative
  • A 1.0 constant that is needed for the w component is stored at fixed address and retrieved by the DSP at an absolute address
  • The code processes 5 streams of vertex data, three components per stream. 3 components x 5 streams x 2 (reads/input and writes/output), use 30 MADRS addresses + 1 extra address for the 1.0 constant
  • The five transformation matrices (one per stream) are 3x4 values each, using 60 of the 64 COEF registers
  • For each DSP sample (44100 times per second) the DSP executes the 128 instructions that do:
    • 15 loads
    • 15 x 4 MACs (Vector/Matrix multiplication, including rotation and translation)
    • 15 stores
  • The maximum theoretical throughput is therefore 5x44100 vertices/second, or 220500 vertices per second. This is in practice not achievable due to all sorts of overheads.
What actually happens:
  • I have a concept of "parking" and "unparking" the DSP. Since the DSP cannot be started/stopped on demand without stopping the entire SCSP (which would make it impossible to play any sound while this thing is working), I use TWO ring buffers. One is sacrificial and contains garbage data I don't care about, the other contains the real data. Switching from one to the other and viceversa is instantaneous and can be done at any time. This allows brave souls (not me) in the future to use this technique and still use the SCSP to play audio at the same time.
  • At startup, we first need to find out the value of DEC, so that we can synchronize with it:
    • Load DSP sentinel program, which writes 1 at (DEC) and 0 at (DEC + 1)
    • Poll location 0 in Sound RAM ring buffer with the 68k to find when DEC is 0
    • Once DEC is found, we start SCSP Timer A synchronized with DEC and firing every 8 samples (to avoid overloading the 68k and creating SRAM bus contention). The ISR decrements a value in Sound RAM that in essence maintains a value that tracks DEC with a precision of +-8, a proxy for the real DEC register that both the 68k and SH-2 can see and use
    • The 68k needs to process two SCSP Timer ISRs, but it's otherwise free to do whatever else, e.g. run a sound driver
  • Then when we want to transform vertices we do:
    • DSP is parked
    • One-time: Load the geometry processing DSP program
    • One time: SH-2 stores in the Sound RAM ring buffer the vertices to be processed, in SoA and 5 separate streams. These arrays can stay there permanently and get processed multiple times, to avoid copying data to Sound RAM over and over, or the data can be changed as often as needed
    • SH-2 reads DEC proxy value
    • SH-2 calculates DEC-relative addresses to set the MADRS registers in such a way that, when we unpark the DSP, the DSP will be reading from the exact location where the vertices are stored, and then write them to other addresses past the input arrays, in the same SoA format
    • The calculation performs a further time compensation to account for DEC drift while the SH-2 writes the MADRS and COEF registers
    • The SH-2 sets SCSP Timer B to fire at a time that is appropriately calculated to ensure the DSP will have processed all the vertices
    • DSP is unpark and starts to process the vertices
    • The SH-2 can do something else while it waits for the DSP to do its thing
    • When Timer B fires, the 68k auto-parks the DSP and signals the SH-2 that it’s done
    • The SH-2 sets up an indirect SCU DMA chain that pulls all the output vertices data out of SRAM in one shot into HWRAM
    • Repeat as needed with more vertices
Both Mednafen and Ymir play the demo correctly (kudos to the level of SCSP DSP emulation they achieved).

The full source code of the library and examples is here: https://github.com/Jollyrogerxp/AVR

I wish to thank XL2 for the great chats about quirky Saturn optimization techniques.

Jollyroger
Author
jollyroger
Downloads
3
Views
3
First release
Last update

Ratings

0.00 star(s) 0 ratings
Back
Top