• Rise from your gwave!

Reading from the framebuffer?

Discussion in 'Saturn Dev' started by XL2, Aug 23, 2018.

  1. XL2

    XL2 Member

    Here is one idea I have to reduce the overdraw in my game :
    Since I'm using RGB code for objects close to the camera and palette codes for objects further away, I thought about simply using the framebuffer as a pseudo- z-buffer.
    Since I know pixels starting with a 1 (RGB code) are closer to the screen than pixels starting with a 0 (palette code), I could just test palette objects against the previous frame to reject them (like, if the area is covered by 1, the object is occluded and doesn't need to be rendered).
    I'm not sure if I could even manage to do it quickly enough to make it worth it - since it would require some perspective divisions for the whole mesh, but I could also just test it with the buffered sprite commands and insert a skip command if it fails the test - but anyway it's worth testing.

    But here is the main issue : reading from the framebuffer is really slow from the little tests I made.
    Of course, I don't transfer the whole buffer, I just made tests with a smaller buffer (like 64x32 or something and just skipped pixels).
    Is it possible to do it indirectly (like a scu dma transfer) so that the Cpu can continue doing other things?
    What would be the best moment to do it? V-blank in?
    And since I'm reading that buffer anyway, if I wanted to transfer to a sprite or scroll layer, what would be the best time to do it?
    SGL has a function to get the framebuffer, which is what I'm using, but there is very little detail on how to properly use it or what it does internally, so I'm not even sure I do it right.

    Right now it's so slow that I would be better off to just create my own z-buffer.
  2. mrkotfw

    mrkotfw Member

    What is the name of the function to read from the framebuffer? This would be a per-pixel test? Do you control the swapping of the framebuffers yourself?

    I'm not really sure I understand the idea.

    What about instead of having a pseudo Z-buffer, use LoD to at least generate billboards, or non-textured versions of the objects/mesh?
  3. XL2

    XL2 Member

    I'm already using a lod, reducing the texture quality and geometry, but it's still not enough as there is a lot of overdraw.
    Some pixels get overwritten like 5 times or more.
    It works ok in single player, but when I add gouraud shading (on the high quality model obly) it becomes a bit too much and the framerate drops often.
    The function is slGetFrame or something like that.

    The idea is that my lod uses palette codes, so the pixels start with a 0, while objects closer to the camera are rgb codes, so the pixels start with a 1 at their msb.
    So I could just test the image space data against the lower res buffer from last frame.
    If it has a 1 and the quad sees only 1s, that means last frame there was an object closer over there, so I can discard that quad. Since the lod quads are large, it might reduce the artifacts caused by this technique, and of course as you drop the framerate it becomes less reliable.
    That buffer would be low res, so I guess it could be cache friendly and allowing quick tests.
    It's not perfect, the pvs would still be my main solution, but the pvs won't work well in some situations and I will need something else.
  4. antime

    antime Extra Hard Mid Boss

    SCU DMA can read from the VDP1 back buffer. Since bandwidth isn't free, it'll cost some performance.

    EDIT: Maybe it would be possible to use the SCU DSP to generate a subsampled version of the backbuffer? There's also still a lot of ca. mid- to late 90s material on occlusion culling available, that may fit the Saturn's limits better than a lot of the material that's presented today.
    Last edited: Aug 24, 2018
  5. Ponut

    Ponut New Member

    So I have a general lack of experience in programming, but I would say a few things that might be helpful:

    1. Occlusion planes rather than "Z-sorting" would be more performant. My only thought of how that would be done is a distance test on the center of objects (meshes) and an intersection test with the occlusion plane (which would actually be a 2D line). With that you could determine if it is past the occlusion plane (in absolute distance) or not and whether or not it is actually intersecting the plane. A solution that can be done entirely on CPU. But this won't work for entire level meshes, only smaller game objects.

    2. Besides that, the frame-buffers are something like 256 KB? Or maybe even 512KB? In that case, that is too much to go through on a single frame. It might not even work because it hits the next copy before the first is finished. Not good!
    Ideally, you would set-up something that copies 1/10th of the frame-buffer at a time and performs your sorting on that. Not ideal, but it should work.
    My idea is like this:

    Uint8 copytimer;
    if(copytimer > 10){
    copytimer = 0;
    slDMACopy(framebuffer + (copytimer * 26215), workarea, 26215);
    ztSortPolygons(framebuffer + (copytimer * 26215)); //Assume this does its work on 26215 byes of buffer at a time
  6. XL2

    XL2 Member

    Thanks both of you.
    Would it be possible by using scu dma to retrieve only the msb and merging these bits in bytes?
    Like a 512x256 buffer would only require 16 KB? And even less if you skip pixels and don't retrieve the whole thing?

    I have been scratching my head for months to solve the occlusion problem and have something that can fit both for Sonic X-Treme's totally inconsistant maps and something like Quake.
    A pvs is nice, but it's only as accurate as your map subdivision, which means you need a lot of memory and spend more time searching the nodes and doing frustum culling.
    A bsp with pvs or portals is nice too, but it doesn't work well for open world maps and since the Saturn has no texture coordinates you end up with really weird walls and floor (like the Slavedriver games) and lots of vertices and quads to deal with.
    A portal system requires lot of manual work and is good only for interior maps.

    I thought too of placing manually some occlusion walls or even doing it automatically, but it can quickly kill peformances so you are restricted to only a few walls unlike a portal system and it requires more manual work than automated techniques, but I will consider your idea Ponut.
    A pvs + "depth" buffer seems like a good fit since the game would generate its own (limited) occlusion map from last frame, but it's by no means simple or super fast.
    Any better ideas?
    Last edited: Aug 24, 2018
  7. mrkotfw

    mrkotfw Member

    In terms of fetching the MSB, I believe that you can use the update features of the SCU DMA to fetch a byte then skip a byte.

    With the SCU DSP, it would mean multiple transfers. From the back buffer to HWRAM to the DSP data banks.
  8. XL2

    XL2 Member

    I guess the SGL function is doing something like that since you can get a low er res version of the buffer (like 64x32), but again there is little to no details for these functions.
    I'm not even sure if it's the SH2 reading the buffer or scu indirect transfer.
    Either ways, when should I call it?
    During v-blank in?

    I tried the pvs technique, with all the raytracing it takes hours to build a map since they aren't corridor maps and it needs to take into account many areas the camera can go to.
    I could probably speed it, but it's a bit insane!
  9. mrkotfw

    mrkotfw Member

    Here is what slGetFrameData does. I haven't verified, but I see no calls to SCU DMA.

    sglC24.o:     file format coff-sh
    Disassembly of section SLPROG:
    00000000 <_slGetFrameData>:
       0:    2f 86          mov.l    r8,@-r15
       2:    c4 b0          mov.b    @(176,gbr),r0
       4:    2f b6          mov.l    r11,@-r15
       6:    2f a6          mov.l    r10,@-r15
       8:    2f 96          mov.l    r9,@-r15
       a:    63 03          mov    r0,r3
       c:    c6 20          mov.l    @(128,gbr),r0
       e:    eb ff          mov    #-1,r11
      10:    4b 18          shll8    r11
      12:    68 09          swap.w    r0,r8
      14:    48 28          shll16    r8
      16:    1b 50          mov.l    r5,@(0,r11)
      18:    e1 00          mov    #0,r1
      1a:    1b 14          mov.l    r1,@(16,r11)
      1c:    1b 85          mov.l    r8,@(20,r11)
      1e:    40 28          shll16    r0
      20:    e2 10          mov    #16,r2
      22:    23 28          tst    r2,r3
      24:    89 00          bt    28 <gtfd_00>
      26:    40 01          shlr    r0
    00000028 <gtfd_00>:
      28:    58 b7          mov.l    @(28,r11),r8
      2a:    1b 60          mov.l    r6,@(0,r11)
      2c:    1b 14          mov.l    r1,@(16,r11)
      2e:    1b 05          mov.l    r0,@(20,r11)
      30:    da 38          mov.l    114 <IMM_FrameBuffer>,r10    ! 25c80000
      32:    d2 37          mov.l    110 <IMM_SPR_EDSR>,r2    ! 25d00010
      34:    67 83          mov    r8,r7
      36:    47 01          shlr    r7
      38:    59 b7          mov.l    @(28,r11),r9
    0000003a <gtfd_10>:
      3a:    60 21          mov.w    @r2,r0
    0000003c <gtfd_11>:
      3c:    c8 02          tst    #2,r0
      3e:    8f 0a          bf.s    56 <gtfd_20>
      40:    e1 7f          mov    #127,r1
    00000042 <gtfd_12>:
      42:    41 10          dt    r1
      44:    8b fd          bf    42 <gtfd_12>
      46:    c4 13          mov.b    @(19,gbr),r0
      48:    e1 80          mov    #-128,r1
      4a:    23 18          tst    r1,r3
      4c:    8f f5          bf.s    3a <gtfd_10>
      4e:    40 11          cmp/pz    r0
      50:    8b 32          bf    b8 <gtfd_99>
      52:    af f3          bra    3c <gtfd_11>
      54:    60 21          mov.w    @r2,r0
    00000056 <gtfd_20>:
      56:    6b 93          mov    r9,r11
      58:    4b 01          shlr    r11
      5a:    e0 08          mov    #8,r0
      5c:    23 08          tst    r0,r3
      5e:    8f 30          bf.s    c2 <gtfd_30>
      60:    45 09          shlr2    r5
    00000062 <gtfd_21>:
      62:    61 b3          mov    r11,r1
      64:    41 29          shlr16    r1
      66:    41 18          shll8    r1
      68:    41 08          shll2    r1
      6a:    31 ac          add    r10,r1
      6c:    62 73          mov    r7,r2
      6e:    63 53          mov    r5,r3
      70:    71 0a          add    #10,r1
    00000072 <gtfd_22>:
      72:    60 23          mov    r2,r0
      74:    40 29          shlr16    r0
      76:    40 00          shll    r0
      78:    00 1d          mov.w    @(r0,r1),r0
      7a:    32 8c          add    r8,r2
      7c:    81 40          mov.w    r0,@(0,r4)
      7e:    60 23          mov    r2,r0
      80:    40 29          shlr16    r0
      82:    40 00          shll    r0
      84:    00 1d          mov.w    @(r0,r1),r0
      86:    32 8c          add    r8,r2
      88:    81 41          mov.w    r0,@(2,r4)
      8a:    60 23          mov    r2,r0
      8c:    40 29          shlr16    r0
      8e:    40 00          shll    r0
      90:    00 1d          mov.w    @(r0,r1),r0
      92:    32 8c          add    r8,r2
      94:    81 42          mov.w    r0,@(4,r4)
      96:    60 23          mov    r2,r0
      98:    40 29          shlr16    r0
      9a:    40 00          shll    r0
      9c:    00 1d          mov.w    @(r0,r1),r0
      9e:    32 8c          add    r8,r2
      a0:    81 43          mov.w    r0,@(6,r4)
      a2:    43 10          dt    r3
      a4:    8f e5          bf.s    72 <gtfd_22>
      a6:    74 08          add    #8,r4
      a8:    46 10          dt    r6
      aa:    8f da          bf.s    62 <gtfd_21>
      ac:    3b 9c          add    r9,r11
      ae:    69 f6          mov.l    @r15+,r9
      b0:    6a f6          mov.l    @r15+,r10
      b2:    6b f6          mov.l    @r15+,r11
      b4:    00 0b          rts  
      b6:    68 f6          mov.l    @r15+,r8
    000000b8 <gtfd_99>:
      b8:    69 f6          mov.l    @r15+,r9
      ba:    6a f6          mov.l    @r15+,r10
      bc:    6b f6          mov.l    @r15+,r11
      be:    00 0b          rts  
      c0:    68 f6          mov.l    @r15+,r8
    000000c2 <gtfd_30>:
      c2:    61 b3          mov    r11,r1
      c4:    41 29          shlr16    r1
      c6:    41 18          shll8    r1
      c8:    41 08          shll2    r1
      ca:    31 ac          add    r10,r1
      cc:    62 73          mov    r7,r2
      ce:    63 53          mov    r5,r3
    000000d0 <gtfd_32>:
      d0:    60 23          mov    r2,r0
      d2:    40 29          shlr16    r0
      d4:    00 1c          mov.b    @(r0,r1),r0
      d6:    32 8c          add    r8,r2
      d8:    80 40          mov.b    r0,@(0,r4)
      da:    60 23          mov    r2,r0
      dc:    40 29          shlr16    r0
      de:    00 1c          mov.b    @(r0,r1),r0
      e0:    32 8c          add    r8,r2
      e2:    80 41          mov.b    r0,@(1,r4)
      e4:    60 23          mov    r2,r0
      e6:    40 29          shlr16    r0
      e8:    00 1c          mov.b    @(r0,r1),r0
      ea:    32 8c          add    r8,r2
      ec:    80 42          mov.b    r0,@(2,r4)
      ee:    60 23          mov    r2,r0
      f0:    40 29          shlr16    r0
      f2:    00 1c          mov.b    @(r0,r1),r0
      f4:    32 8c          add    r8,r2
      f6:    80 43          mov.b    r0,@(3,r4)
      f8:    43 10          dt    r3
      fa:    8f e9          bf.s    d0 <gtfd_32>
      fc:    74 04          add    #4,r4
      fe:    46 10          dt    r6
    100:    8f df          bf.s    c2 <gtfd_30>
    102:    3b 9c          add    r9,r11
    104:    69 f6          mov.l    @r15+,r9
    106:    6a f6          mov.l    @r15+,r10
    108:    6b f6          mov.l    @r15+,r11
    10a:    00 0b          rts  
    10c:    68 f6          mov.l    @r15+,r8
    00000110 <IMM_SPR_EDSR>:
    110:    25 d0          mov.b    r13,@r5
    112:    00 10          .word 0x0010
    00000114 <IMM_FrameBuffer>:
    114:    25 c8          tst    r12,r5
  10. antime

    antime Extra Hard Mid Boss

    If the function used DMA, it would almost certainly be documented, to prevent conflicts. It doesn't look like you can copy bytes, the read address increment options are 0 and 4, and the write address increment options do not include one byte.
  11. XL2

    XL2 Member

    Wow, thanks a lot, amazing!
    How did you dissassemble the function?
    I guess I should really try to learn assembly...

    AFAIK, with SGL channel scu dma 0 is free for the user while the rest is used by SGL.

    I guess I could just let the slave do it at the start of the game loop while the main cpu is preparing the frustum and other stuff?
    Even if it doesn't work well for occlusion, sending the framebuffer to a sprite is also a nice effect, so nothing would be lost.

    Edit : I did manage to increase the speed of the pvs building quite a bit, so it might be a viable solution with RLE compression.
  12. antime

    antime Extra Hard Mid Boss

    Objdump can disassemble object files. Other useful binutils tools include ar and nm.
    XL2 likes this.
  13. mrkotfw

    mrkotfw Member

    On MinGW/Cygwin/Unix:

    mkdir libsgl
    cd libsgl
    sh-elf-ar libsgl.a
    for obj in *.o; do sh-elf-objdump -d "${obj}" > "${obj%%.o}.s"; done
    I've attached a .zip file for you that includes the source.

    Attached Files:

    XL2 likes this.
  14. XL2

    XL2 Member

    Thanks a lot,
    I guess either using the slave to do it or using a scu dsp transfer would be my best options.
    I will be taking a look at these functions later this week.
    Thanks again
  15. XL2

    XL2 Member

    I think that will give you a better idea of what I'm thinking of doing and it was very easy to implement (but it's still a bit slow).
    If would complement the PVS and hopefully I will find a way to subdivide these quads close to the camera to prevent such bad clipping, but anyway : you can see the weird colors are color bank pixels that I just flipped the MSB to have them displayed using a 16 bits sprite.
    These would be the occludees, while the correctly colored quads would be the occluders.
    So these huge objects blocking the camera and both sides of a node could at least block some extra geometry.
    This buffer is currently 88x56, which seems like it could work if the algorithm is conservative (like check a bit more than the quads' boundaries to prevent rejecting too much), but making it fast is a whole other thing and I'm not sure it can be done, but whatever.

    Attached Files:

    David Gámiz Jiménez and Ponut like this.
  16. mrkotfw

    mrkotfw Member

    Thanks, that gives me a better idea.

    There's a few things here...
    1. Quad subdivision. I'm curious to know what algorithms are available for subdividing quads
    2. You're positive that your bottleneck (currently) is the VDP1 and not something else
    I know this is outside the realm of your original question, but what about the way command lists are being passed to the VDP1? Is it that SGL processes a large command list, then triggers the VDP1 to draw, or does it keep the VDP1 fed as much as possible while processing other command lists? As in, process a small batch, have the VDP1 render, and in parallel, process the next batch?

    Are there other areas to improve on performance? Have you timed your code with the CPU FRT? Have you timed how long it takes to render?

    The framebuffer idea seems wild. With the DSP, you have 4 data banks, each 1024 bytes. The small access is 4 bytes. You have the ability to DMA straight from the DSP and into its 4 data (and 1 prog) bank. Though, I've tried to DMA from LWRAM and the Saturn would lock up, so I'm not sure if you'd be able to DMA straight from the B-bus to the DSP data banks. I believe it can do 4 loads in parallel, though, some in non-general purpose registers (A, X, Y, etc.). I just don't know how you use the DSP for this purpose.

    Then there's the slave CPU. You DMA from VDP1 FB to HWRAM. Then you have to keep the slave off the CPU bus, so you manually copy chunks of the DMA'd FB into the slave's split cache. You have about 2KiB there.

    I'm just throwing ideas out there. I don't know if you've done this already, but getting some way to objectively profile the game would be a really good step to take soon.
  17. XL2

    XL2 Member

    Quad subdivision is tricky, unless of course you just store different textures and polygons/vertices. That would be the fastest way for sure, but it takes way too much memory, from both RAM and VRAM. It's easy to subdivide a sprite on the height (it's what I do for the water animation, I just "scroll" the starting address). I guess maybe doing something like Quake 2 on PS1 could be one workaround, but it's anoying to always encounter loading screens. Creating new vertices and polygon in realtime could also be done and is what the PS1 does with its SDK afaik, even if it's slower than just storing it in RAM, but I'm not sure how to subdivide a sprite horizontally in VRAM.
    If you change the width, you will just end up with a sprite that will just alternate lines with the other horizontal part, and changing the pointers will lead to the same problem.

    As for performances, for sure my CPU code can be improved a lot, I'm not doubting that. How SGL works with the draw commands is that it stores everything in a few buffers : vertex buffer, polygon buffer, z-sort buffer and draw commands buffer. When you synch, it just DMA everything in one batch to VRAM and the CPU moves on.

    Afaik, since LWRAM is on the a bus, I guess you can't directly DMA to the DSP, but I could be wrong.

    As for how I know it's the VDP1 that is the bottleneck, I simply have different debug modes : untextured polygons only, wireframe only, gouraud shaded textured polygons, etc.
    The gouraud shaded polygons leads to many slowdowns, which doesn't happen in other modes.
    Of course, with better CPU optimizations, I could probably do more, but at the same time, all this overdraw is also increasing the CPU load (more vertices and polygons to process for nothing).

    But anyway, since the reaction so far with the Sage demo has been negative overall (some people even complain that I'm using 3D models instead of sprites!) and most people just try the demo on their slow PCs with emulators and don't even bother plugging in a controller and then complain online that it doesn't control well or that it slows down, I'll just stop wasting time on Sonic Z-Treme and move on to the FPS game. Which means that a simple portal system could be implemented, so I don't need to overthink for an all-around solution.

    That solves many issues and involves less work in the end since I don't need to try to micmic a game not even built for the Saturn, but I might still play with the framebuffer to add some cool effects.
  18. Ponut

    Ponut New Member

    I know its off-topic, but wow. I would say they have high standards but maybe in that case I am confusing high with low.
    (I would say something about performance but I have an i7 4770K @ 4.5 GHz..)

    As far as performance goes, I know I don't have much to add. Have you explored the option of only partially calculating the occlusion/PVS each frame?
    (The idea being the occlusion is a "buffer" of occluded polygons that is filled partially each frame)
  19. XL2

    XL2 Member

    With a portal system I could easily precalculate the pvs, so at runtime all you need to do is uncompress the pvs for your current node/leaf, flag the visible nodes with the current ticks and then run your bsp/octree normally but you don't bother with nodes that aren't potentially visible. It speeds up the cpu calculations quite a lot and it solves partially the occlusion problem. You can also do like the Slavedriver engine and just add user clipping draw commands to prevent even more overdraw, but you would need to clip these against the portals, so I'm not 100% sure it's worth the extra cpu load. Sgl has a sorting option where it draws all polygons in front of the previous polygons within the same pdata, so you could always include a user clip command first for each plane and use the "sort before" option for all the following polygons, which minimizes the overdraw as much as you can on Saturn.
    Anyway, I will take my time on this to properly write a bsp compiler and portal generator, so it might take a few months.
    mrkotfw and Ponut like this.
  20. mrkotfw

    mrkotfw Member

    Yeah, that's why I asked. Quad division is really tricky, not including the fact that there's no hardware UV texture support.

    Okay, that makes sense. I guess that keeps you from idling both on the VDP1 and CPU.

    It's on the CPU-bus, sadly.

    That's insane. Where is this negative feedback coming from? With anything, you're going to get your percentage of idiots who don't know what they're talking about. You know you've made it to the big leagues when you start getting death threats. Don't let that discourage you. Really, work on what makes you happy.

    Another thing is that the game looks like a vertical slice rather than a tech demo. Some people may have a hard time understanding that. If it was more of a prototype, it might allow people to have a better understanding that the game is of course still in progress. But then again, people are stupid.


Share This Page