SCU DSP for matrix transformation?

XL2

Established Member
So I've started to work on a BSP/PVS solution (very early) for my game, which made me realize that I could also save several clock cycles by changing the 3d implementation to skip the z-sort (I'm not saying I will do it, just that it would be interesting to look at).
According to Sega's documentation, it's possible to use the SCU DSP for matrix transformation and Sega suggested back in 1995 to use the SCU DSP for the matrix and the SH2 for the polygon processing in parallel.
Sega even give an example of assembly code to do the matrix transformation. (https://antime.kapsi.fi/sega/files/ST-240-A-042795.pdf)
Now, I know that SGL doesn't support matrix transformation with the SCU DSP and the SGL functions for the SCU DSP seem pretty much useless for almost everything, but before I waste too much time one this, as anyone tried doing it?
The SCU DSP doesn't support divisions, but it can do multiplications/additions, so it should be fine for matrix transformation, and even it's slower than the SH2, it won't need to do slower operations such as nearclipping/light normals processing/gouraud shading processing, so it might still be possible to keep it synched.
It seems like it can hold 256 sint32 values, which is more than what my maps have on each quad planes (since each quad is like 64 pixels wide and each plane is 256 pixels wide, I should be fine)
It would require writing a new 3d implementation or using the obscure one from SBL that nobody used (AFAIK), but it would still be nice to know about others' experience with it.

Thanks!
 
It's not 100% clear to me what you want the SCU DSP to do.

Do you want the SCU DSP to spit out 4 projected points while the CPU walks the BSP tree and feeds the SCU DSP more quads (points)?

I would also really look at the assembly output and see if you can optimize on the CPU end before thinking about the DSP. More specifically, how well the CPU cache is being used, and if any of the SH-2 DSP instructions are being used effectively (reordering to avoid pipeline stalls).

It's a pain in the ass...

I was also thinking that you could split the slave CPU cache into two and have it spin for jobs. The jobs would be batches (< 2KiB) worth of points to project. When done, DMA to HWRAM.
 
It's not 100% clear to me what you want the SCU DSP to do.

Do you want the SCU DSP to spit out 4 projected points while the CPU walks the BSP tree and feeds the SCU DSP more quads (points)?

I would also really look at the assembly output and see if you can optimize on the CPU end before thinking about the DSP. More specifically, how well the CPU cache is being used, and if any of the SH-2 DSP instructions are being used effectively (reordering to avoid pipeline stalls).

It's a pain in the ass...

I was also thinking that you could split the slave CPU cache into two and have it spin for jobs. The jobs would be batches (< 2KiB) worth of points to project. When done, DMA to HWRAM.

Well, quads and points are 2 different things since you can use the same point in a couple of quads. Each point is also 3 int32 values (x,y,z).
The idea would be to transform a small list of points (like 10) from on BSP leaf, while the CPU does other things, like uncompressing the BSP tree data, frustrum culling, walking the tree, calculating lightning, gouraud, stuff like that.

Sega suggested using the SCU DSP for matrix transformation (so, the points), so I'm wondering if anyone tried and if so, does it create issues.

Like you said, optimizing the CPU code first is the most important step, but before I move too far in one direction I'd like to know my options and try to plan a use for the SCU DSP.
 
To really answer your question, I don't think anyone has really tried the SCU DSP. Except for maybe Rockin'-B, but he's been MIA for a few years.

How about also do some tests and see if it's worth all the trouble? Maybe it's faster to do it on the slave CPU since you still have to do the perspective divide on the CPU.
 
To really answer your question, I don't think anyone has really tried the SCU DSP. Except for maybe Rockin'-B, but he's been MIA for a few years.

How about also do some tests and see if it's worth all the trouble? Maybe it's faster to do it on the slave CPU since you still have to do the perspective divide on the CPU.
Yeah, if nobody tried it I might, but I need to plan ahead.
I'll dig deeper in Sega's documentation.
If I ever make it work I'll make sure to update the fps demo with it.
 
I don't know who said that Quake on Saturn didn't use the SCU DSP, but it seems like it does actually according to Yabause.
I've yet to learn assembly, so I'm not sure what it is exactly, but it seems to involve several multiplications/additions.
Could it be matrix transformation?
 
Can you dump the 1KiB and disassemble it using antime's SCU DSP disassembler?

You can probably use the DSP as a non-VLIW arch at the beginning.

I myself don't fully understand the idea how memory is segmented.
 
Code:
000: 00001c10   nop  nop                   nop                   mov 10,CT0
001: 81000000   mvi 1000000,MC0
002: 00001e01   nop  nop                   nop                   mov 1,CT2
003: 88010000   mvi 10000,MC2
004: 00001e03   nop  nop                   nop                   mov 3,CT2
005: 88000001   mvi 1,MC2
006: 88000100   mvi 100,MC2
007: 88010000   mvi 10000,MC2
008: 94800000   mvi 800000,PL
009: 10040000   add  nop                              mov ALU,A  nop
00a: 28003209   sl   nop                   nop                   mov ALL,MC2
00b: 00001c00   nop  nop                   nop                   mov 0,CT0
00c: 00003600   nop  nop                   nop                   mov M0,RA0
00d: 00001c01   nop  nop                   nop                   mov 1,CT0
00e: c001000f   dma2 D0,MC0,f
00f: d340000f   jmp T0,f
010: 00001c0e   nop  nop                   nop                   mov e,CT0
011: 00003604   nop  nop                   nop                   mov MC0,RA0
012: 00003704   nop  nop                   nop                   mov MC0,WA0
013: 00001c0d   nop  nop                   nop                   mov d,CT0
014: 00823500   nop                                   clr A      mov M0,PL
015: 08040000   or   nop                              mov ALU,A  nop
016: d208001a   jmp NZ,1a
017: 00000000   nop  nop                   nop                   nop
018: f8000000   endi
019: 00000000   nop  nop                   nop                   nop
01a: 00001e02   nop  nop                   nop                   mov 2,CT2
01b: 00801514   nop                        nop                   mov 14,PL
01c: 14040000   sub  nop                              mov ALU,A  nop
01d: d2100023   jmp NS,23
01e: 00000000   nop  nop                   nop                   nop
01f: 00023200   nop  nop                              clr A      mov M0,MC2
020: 00003009   nop  nop                   nop                   mov ALL,MC0
021: d0000025   jmp 25
022: 00000000   nop  nop                   nop                   nop
023: 00003009   nop  nop                   nop                   mov ALL,MC0
024: 88000014   mvi 14,MC2
025: 00001d00   nop  nop                   nop                   mov 0,CT1
026: c001010f   dma2 D0,MC1,f
027: d3400027   jmp T0,27
028: 00001e03   nop  nop                   nop                   mov 3,CT2
029: 00001d00   nop  nop                   nop                   mov 0,CT1
02a: 00001f00   nop  nop                   nop                   mov 0,CT3
02b: 00001c10   nop  nop                   nop                   mov 10,CT0
02c: 00071a0e   nop  nop                              mov MC0,A  mov e,LOP
02d: 00001b30   nop  nop                   nop                   mov 30,TOP
02e: 02598000   nop  mov MC1,X             mov MC2,Y             nop
02f: 01098000   nop             mov MUL,P  mov MC2,Y             nop
030: 0509b309   and             mov MUL,P  mov MC2,Y             mov ALL,MC3
031: 0509b309   and             mov MUL,P  mov MC2,Y             mov ALL,MC3
032: 00001e03   nop  nop                   nop                   mov 3,CT2
033: 0759b309   and  mov MC1,X  mov MUL,P  mov MC2,Y             mov ALL,MC3
034: e0000000   btm
035: 0509b309   and             mov MUL,P  mov MC2,Y             mov ALL,MC3
036: 00001e04   nop  nop                   nop                   mov 4,CT2
037: 00001d00   nop  nop                   nop                   mov 0,CT1
038: 00021f00   nop  nop                              clr A      mov 0,CT3
039: 00001a0e   nop  nop                   nop                   mov e,LOP
03a: 00001b3d   nop  nop                   nop                   mov 3d,TOP
03b: 02788000   nop  mov MC3,X             mov M2,Y              nop
03c: 03700000   nop  mov MC3,X  mov MUL,P  nop                   nop
03d: 1b70310a   ad2  mov MC3,X  mov MUL,P  nop                   mov ALH,MC1
03e: 1b70310a   ad2  mov MC3,X  mov MUL,P  nop                   mov ALH,MC1
03f: 1b70310a   ad2  mov MC3,X  mov MUL,P  nop                   mov ALH,MC1
040: e0000000   btm
041: 1b70310a   ad2  mov MC3,X  mov MUL,P  nop                   mov ALH,MC1
042: 94000001   mvi 1,PL
043: 00001c01   nop  nop                   nop                   mov 1,CT0
044: 00001e02   nop  nop                   nop                   mov 2,CT2
045: 00069d00   nop  nop                              mov M2,A   mov 0,CT1
046: 14041f00   sub  nop                              mov ALU,A  mov 0,CT3
047: 00003a09   nop  nop                   nop                   mov ALL,LOP
048: 00001b4b   nop  nop                   nop                   mov 4b,TOP
049: 00001e00   nop  nop                   nop                   mov 0,CT2
04a: 00821500   nop                                   clr A      mov 0,PL
04b: 12593209   add  mov MC1,X             mov MC0,Y             mov ALL,MC2
04c: 035b1e00   nop  mov MC1,X  mov MUL,P  mov MC0,Y  clr A      mov 0,CT2
04d: 1b5d3d06   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  mov ALU,A  mov MC2,CT1
04e: 1b2d0000   ad2  mov M2,X   mov MUL,P  mov MC0,Y  mov ALU,A  nop
04f: 1b5d1e00   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  mov ALU,A  mov 0,CT2
050: 1b5b330a   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  clr A      mov ALH,MC3
051: 1b5d3d06   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  mov ALU,A  mov MC2,CT1
052: 1b2d0000   ad2  mov M2,X   mov MUL,P  mov MC0,Y  mov ALU,A  nop
053: 1b5d0000   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  mov ALU,A  nop
054: 1b5b330a   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  clr A      mov ALH,MC3
055: 1b5d0000   ad2  mov MC1,X  mov MUL,P  mov MC0,Y  mov ALU,A  nop
056: 1b2d1e00   ad2  mov M2,X   mov MUL,P  mov MC0,Y  mov ALU,A  mov 0,CT2
057: 19041c01   ad2             mov MUL,P             mov ALU,A  mov 1,CT0
058: 1900330a   ad2             mov MUL,P  nop                   mov ALH,MC3
059: e0000000   btm
05a: 00869503   nop                                   mov M2,A   mov 3,PL
05b: 00001f00   nop  nop                   nop                   mov 0,CT3
05c: c001133c   dma2 MC3,D0,3c
05d: d340005d   jmp T0,5d
05e: 00000000   nop  nop                   nop                   nop
05f: d0000013   jmp 13
060: 00000000   nop  nop                   nop                   nop


(after 060 it's just end code)


The code seems to stay the same everytime I look at it ingame, which (I guess) means it's always using the same function.
 
Last edited:
Ok, I feel retarded : SBL (Sega Basic Library) already has all the functions in, with source code.
Including 3d processing using the SH2 and SCU DSP working in parallel.
I don't know the performance level, so maybe SGL is still faster, but it does include the source code.

From the SPR manual :

(3) USE_DSP

When USE_DSP is defined, the coordinate transform matrix
calculations are done with the DSP in parallel the SH side.
Commenting the define out disables this feature.
 
Ok, I feel retarded : SBL (Sega Basic Library) already has all the functions in, with source code.
Including 3d processing using the SH2 and SCU DSP working in parallel.
I don't know the performance level, so maybe SGL is still faster, but it does include the source code.

From the SPR manual :

(3) USE_DSP

When USE_DSP is defined, the coordinate transform matrix
calculations are done with the DSP in parallel the SH side.
Commenting the define out disables this feature.

Is USE_DSP defined by default in one of the required libraries, and the developers comment it out if they don't want to use it?

Or do they have to define it explicitly in their project?

I just loaded up Elan Doree in Yabause and checked the SCU-DSP debug and it just has 20 lines saying "END" so that game isn't using it.

How do we tell whether a game is using SGL or not?
 
Is USE_DSP defined by default in one of the required libraries, and the developers comment it out if they don't want to use it?

Or do they have to define it explicitly in their project?

I just loaded up Elan Doree in Yabause and checked the SCU-DSP debug and it just has 20 lines saying "END" so that game isn't using it.

How do we tell whether a game is using SGL or not?

You just need to define it and call the proper functions, but I'm pretty sure SGL is faster as they already stopped supporting the SBL 3d functions in 1995/1996.

You can't tell if a game is using SGL, but only few games are as it was too little too late.

Night is supposedly using SGL, and isn't using the SCU DSP.

Games using the SCU DSP that I know of : Sonic R, Quake and Burning Rangers. They all have in common that they are late games and look amazing.
 
Have you been able to understand what Quake is doing?
It's almost exactly 1:1 the same code as the matrix transformation example in the SCU DSP manual / SBL SCU DSP functions, so I guess it's matrix transformation.

It could also be related to lightning.
 
Last edited:
You just need to define it and call the proper functions, but I'm pretty sure SGL is faster as they already stopped supporting the SBL 3d functions in 1995/1996.

You can't tell if a game is using SGL, but only few games are as it was too little too late.

Night is supposedly using SGL, and isn't using the SCU DSP.

Games using the SCU DSP that I know of : Sonic R, Quake and Burning Rangers. They all have in common that they are late games and look amazing.

I got out Fighters Megamix and tried it in Yabause and it looks like that is using the SCU DSP. That was released Dec 1996 in Japan. The debug code list didn't seem as dense as what you posted for Quake though.

Can you point me to the manual that has the USE_DSP and functions etc? I was looking through some of the antime list of stuff the other day but didn't see it yet.

Thanks
 
I got out Fighters Megamix and tried it in Yabause and it looks like that is using the SCU DSP. That was released Dec 1996 in Japan. The debug code list didn't seem as dense as what you posted for Quake though.

Can you point me to the manual that has the USE_DSP and functions etc? I was looking through some of the antime list of stuff the other day but didn't see it yet.

Thanks
It's in the SBL folder under the MAN (manual) folder if I remember correctly. The 3d functions are under SPR in the Segalib folder (again, if I remember right).
The USE_DSP is simply in the SBL code, always on.
I never tried to display a quad in SBL, but I will look at it and maybe try to modify it a bit just to see if it could be improved.
 
It's in the SBL folder under the MAN (manual) folder if I remember correctly. The 3d functions are under SPR in the Segalib folder (again, if I remember right).
The USE_DSP is simply in the SBL code, always on.
I never tried to display a quad in SBL, but I will look at it and maybe try to modify it a bit just to see if it could be improved.

Can you cheat the quad 3D system by having 2 points share the same coordinates to get triangles? Obviously no speed advantage but more flexibility in 3D model design if possible
 
Can you cheat the quad 3D system by having 2 points share the same coordinates to get triangles? Obviously no speed advantage but more flexibility in 3D model design if possible
Yes, you can. My map tool for Sonic Z-Treme does it when I detect a triangle.
But the textures get all squished and it just look bad, unless you use untextured triangles, so I would avoid it.
It's also easy to merge 2 triangles in Blender to do a quad, so you can avoid it most of the time.
Rockin B also made a texture mapping demo, but I never looked how he did it.
It will be slower than just using sprites, so I would avoid it too.
 
On a side-note, mixing maps from different games can lead to weird moments :


Sonic_Quake_treme_2.png Sonic_Quake_treme_4.png Sonic_Quake_treme.png

At least nobody is shooting at Sonic...
 
Last edited:
Back
Top