• Rise from your gwave!

Saturn's 3D capabilities

Discussion in 'Saturn Dev' started by VladR, Sep 28, 2018.

  1. VladR

    VladR New Member

    I encountered this HW comparison between Saturn/PS1/N64, just not sure if I can fully trust it:
    https://segaretro.org/Sega_Saturn/Hardware_comparison

    I don't really care about the comparison itself (e.g. against, say, PS1), but I quite like that all the benchmark numbers are together at one page, giving me a really nice summary of Saturn's bandwidth and hardwired shading/texturing capabilities.

    Is that comparison a bogus fanboy site, or is it legit ?

    There is one area that doesn't appear to make much sense, however: Flatshading (Both 8-bit and 15-bit) shows identical value : 28 MPixels/s via VDP1.

    I don't know how it's internally implemented in a silicon, but there's half amount of data being written for the 8-bit color, so it should be roughly double the value of the 16-bit (minus the scanline traversal overhead, of course, which is identical for both cases), given that all other benchmarks show differences at different color depths.

    Another thing: Gouraud shading: 16 MPixels/s (10x10). According to note No 45, Saturn can shade 164,576 polygons/s (10x10 pixels). That would be roughly 2,743 quads at 60 fps.
    Given that these are theoretical numbers where all performance is spent just on that single feature alone, it sounds, kinda, low - no ?

    Those numbers, are obviously, still an awesome boost over Jaguar, let alone the fact that on jag you have to do scanline traversal, and compute endpoint data for Bliiter for every scanline (basically, killing 90% of RISC's performance on just that). Here, it appears, you just give the Saturn coordinates, colors, and it'll shade and interpolate whole polygon for you, automagically.

    Which is, obviously, pretty awesome, as you suddenly also gain 90% of GPU's performance for other stuff and effects, as you don't have to handhold Blitter for each scanline of the polygon...

    What's the most detailed Gouraud shaded game on Saturn ?
     
  2. Ponut

    Ponut New Member

    What I know is you can realistically expect about 750 flat shaded, 32x32 textured polygons at 30 FPS with some special effects running alongside that (music, sound, shading, animation). This isn't just based on XL2's demos, it's also based on my own tests.

    The key point to take away here is that the Saturn is not very good at textures in particular. If your polygons have no textures, and are flat color flat shaded, you can encroach upon the 1200-1300 polygon limit that is in place on VDP1 itself (you can't exceed that number of polygons, forget what it is exactly, more polygons simply won't display).

    Another thing to take away is using any real-time shading takes down the number of polygons you can realistically render, even if that's only being applied to some polygons.

    This is all in reference to SGL, not SBL or any other development library. There is hardware overhead that comes from development libraries that means you just can't reach what the hardware is capable of in a raw sense. I wouldn't trust that website's hardware comparisons for graphics, but what do I know? Only the basics of what the hardware can *actually* do, in code, in a game.

    The game that uses the most Gouraud shading is probably Quake. You should also take a look at GunGriffon [and/or II].
     
    Last edited: Sep 29, 2018
    VladR likes this.
  3. antime

    antime Extra Hard Mid Boss

    The VDP1 framebuffer is 16bpp, regardless of source data format (in high-res and HDTV formats it's 8bpp, but there's no indication that's what is meant). There's also just a single type of untextured quad drawing command, where the color is given as a 16 bit value. I suspect the 8bpp entry is inserted because on other hardware there is a difference.

    It also looks like the whole table uses theoretical values, calculated from numbers given in various documentation, rather than measured, so it should be taken with a huge grain of salt.
     
    VladR and Ponut like this.
  4. XL2

    XL2 Member

    The quad count isn't as important as the amount of pixels written.
    I did a test with Quake's maps in Sonic Z-Treme and reached 1000 drawn quads at 30 fps, including about 100 with gouraud shading, with higher quality textures than in Sonic X-Treme maps.
    Many things influence how much you can draw.
    Texture quality has a huge impact (16x16 vs 64x64 vs flat), gouraud shading too, color calculation effects, etc.
    Also, the distortion of quads has a really huge impact since it creates lot of overdraw, sometimes almost doubling the number of pixels getting written.
    I think using 16 colors lookup tables might be a bit slower than using a palette code since the vdp2 processes the palette codes (the actual color) instead of the vdp1, but I could be wrong.
    A draw command takes something like 36 cycles to process (not drawing, just reading it), so it's not that bad.
    The reason I can "only" pull 750-800 quads at 30 fps in Sonic Z-Treme is because I merge several quads, sometimes 16 quads become 1, and generate a new texture at half width and half height (like 32x32 becomes 16x16), so the number of drawn pixels is in fact about the same as if you draw these 16 quads with also reduced texture quality.
    The biggest issue with really high quad count is mainly about the cpu, since you need to transform the vertices, check if the quad is in the view frustum, create and send the draw commands, etc.

    According to corvus, who analysed most Saturn games, some games can reach 2000 quads at 30 or 25 fps.

    So in theory maybe the Saturn is faster on paper, but because of the overdraw issue when using distorted sprites the performances are usualy worse than on PS1. And the Saturn is terrible at transparency, also because of that overdraw issue and because of the architecture (vdp1 vs vdp2).

    About gouraud shading, many games use gouraud shading almost everywhere, like Tomb Raider, Burning Rangers and Nights.
    Quake is one of the most impressive games for its VDP1 usage, but sadly the technique they used seems slow for the cpu (bsp and lots of portals). It also doesn't use gouraud shading on enemies.
     
    Last edited: Sep 29, 2018
  5. VladR

    VladR New Member

    Thanks. I presume that number is not primarily transform-bound ? But, it's a reflection of a total system load (VDP bandwidth, DMA, Audio bandwidth, CPU cost), correct ?
    What kind of game type are we talking here?

    Well, i suppose the definition of "very good" depends on where you're coming from. I'm coming from Jaguar (which can do only single-line texture spans), so :hehehe:
    Why wouldn't VDP process more polygons than those ~1,200 ? I presume it just won't be 60 fps anymore (which is OK for plenty games).
    I presume, that unlike ObjectProcessor in Jaguar, which forces 60-fps upon you (otherwise you encounter very nasty screen tear glitch), VDP simply processes polygons in batch (from my current understanding of the docs).
    So, if it takes 10 frames to render them (resulting in 60/10 = 6 fps), than VDP will not glitch, correct ?

    Or is there some arbitrary limit placed on the length of Command Table ? It appears to me you can freely place jumps and routines into it, so if one wanted, he could submit a 20,000 polygon batch (resulting in ~1 fps), correct ?



    Wait, do you mean that if you have 900 polys untextured and 100 polys textured, you still pay the performance price of texturing even on those 900 untextured ones ? That would be quite weird.




    Yes, the HW isn't doing things for free, obviously. It's clocked at some frequency, that code has 2 nested loops (outer and inner), which can be BTW nicely benchmarked on Jaguar, because you can specify NOP Blitter operation, thus only the loop traversal is happening in HW.
    Having implemented texturing manually, I can certainly appreciate if HW is doing it for me, like is the case of Saturn.



    I don't follow. The theoretical number (that is often showcased upon new HW launch) is usually obtained by spending all resources on just one test.
    So, in case of our previously mentioned VDP1 Gouraud shading test, it would mean having a list of ~2,700 quads (10x10) pixels. Thus resulting in 2,700 * 60 - ~165,000


    Still, and this is my opinion that I has been brewing in my mind last few years when I was merely considering doing something also on Saturn (finally taking the action now), all those hypothetical numbers can actually be increased, as they merely measure the throughput of the VDPs.

    There's 2 SH-2s and a DSP in the system, giving you roughly ~71 MHz of RISC throughput.
    I have a pretty good idea what 26 MHz RISC can do in SW (in terms of flatshading - as I still keep both HW (Blitter-assisted) and SW codepaths in the build (via compile-time flags)), now if we ~triple that, we can get to what Saturn is capable of.

    And that's all just on top of VDP's polygon throughput. Of course, what I don't know, is how much bandwidth is left in the system (for SH-2s, to draw into the framebuffer) after VDP's done, but unlike Jaguar, which has just 1 bus shared with everything, Saturn has several buses, with DMA access, so I think those 71 MHz could impose some serious damage on top of VDP and bring scene complexity from ~PS2 to Saturn.

    Obviously, that's at the cost of immense technological effort :hehehe: Just dividing the rendering effort between 2 processors without incurring long waiting is hard enough. Adding third processor to the mix (DSP) would certainly make things even more complex, but oh, man - what an absolutely wonderful challenge!


    Basically, my idea of a hybrid HW+SW engine is this:
    1. Render Background 3D scene via VDP (in HW)
    2. Render Foreground 3D scene via SH-2s + DSP (in SW)

    I already have the working RISC triangle flatshader in SW, with clipping and everything. It's jaguar's RISC, but that shouldn't be too much effort to port to SH-2 (from quick glance at the SH-2 instruction set).

    It's not gonna work for any generic 3D game type, but racing, platformers, top-down RPGs could totally work like that just fine (as there's not huge Z-buffer issues that way).

    Let's try to imagine a simple example - imagine a Diablo 3-style camera:
    1. VDP: 3D Environment (HW rendering)
    2. SH-2s: 3D Characters (SW rendering) - of course limited to few characters, not dozens :)
    Hope that makes sense...
     
  6. VladR

    VladR New Member

    That's fine, I am taking those numbers only in the context of a theoretical benchmark, outside of real-world gameplay scenario.

    Wow! Full. Stop.

    Does it mean, that for CPU direct access of the framebuffer, 320x224 at 4-bit color actually stores 16-bits per every pixel ? So, if I was flatshading at 4-bit, one 16bit write wouldn't fill 4 pixels, but just one ? OMG...
    So, all the sub-16-bit coloring on Saturn is merely indirect, internally ?

    I am asking, as on Jaguar, framebuffer 320x200 at 4-bit color takes 32 KB, but at 16-bit color it takes 128 KB. That's a huge difference in terms of bandwidth, and how much performance is left for the sytem.

    There are many 16-color scenarios where its much higher performance would be quite useful, but if 16-bit is forced upon coder, that really sucks...
     
  7. VladR

    VladR New Member

    Yeah, but your engine is BSP/portal, right ? That's a lot of CPU overhead every single frame. Still, a very sexy number. How much usage of slave SH2 do you do ?

    Wait, does it mean that internally VDP still processes each pixel of the bounding box of such quad ? Because in worst case (when a rectangle is skewed in 45' angle), about half of them are transparent, as they don't belong to the quad in the first place (they're outside of the edges).
    I was hoping VDP would internally do scanline-based rasterization via edge tracing...

    Only 2000, eh ? OK, nice to know, but since it's likely textured, it's not a bad number...

    Yeah, BSP and portals are a CPU killer unfortunately, which is why I stay away from those types of engines personally...
     
  8. Ponut

    Ponut New Member

    don't worry about me, just idle conjecture of someone with far less programming experience than anyone else here...

    That can actually be how it ends up working if you don't organize your commands right.
    I obviously don't, because I have no idea what I'm doing.

    IIRC it's not a hardware limit, rather it is an SGL limit. If there are more polygons than the limit, they simply do not display with SGL default behavior. Other folks should know how to get past this limit, where the limit is, or if I am just seeing some other glitch.
    Other folks say there are games that exceed that and I certainly believe it.

    2. Not really a game yet :(
    1. No, it's not transform-bound. It's bound based on B-bus saturation (which is on the same communication bus as the SCSP [sound], VDP1, and VDP2). I think if you used MIDIs you could get good improvements.

    I'm the least experienced and least accomplished here, so please, defer to those with more experience. Thanks for reading. (I'm aware based on other's more detailed answers I should not have even posted :p )
     
    Last edited: Sep 29, 2018
  9. antime

    antime Extra Hard Mid Boss

    No such thing.

    Each framebuffer pixel consists of 1 bit to indicate format (RGB or palette data), and then either a RGB555 colour, or a format-dependent mix of palette address and palette index.
     
  10. VladR

    VladR New Member

    Ouch, what a faceplant :(
    And here I was, thinking what kind of super hipoly scenes I could do at 25 % bandwidth like on jag.

    That would explain why that comparison web shows identical numbers for 8&16 bit.

    What a waste of performance. Now Saturn is fast, but not 4x as fast to compensate for 4x more data...

    So, basically one is indeed forced to 16-bit, unless in hires?

    So, jaguar is the last 32bit machine to offer fast flatshading at 4-bit color depth. It's a shame, as flatshading looks awesome clean and sharp at higher resolution.
     
  11. XL2

    XL2 Member

    I didn't read all the posts after mine, but the framebuffer is 16 bpp, you just use color lookup tables and palettes to index actual 16 bits colors (or 8 bits in high res).
    The only speed you get using palettes is that the vdp1 doesn't need to do the lookup, just the vdp2 which is very fast (really small gain, if any).
    About my engine, it's currently just an octree with LOD and mipmaps, so it's not that hard on the cpu.
    I'm now working on a bsp compiler with pvs, but no portals ingame (I accept some overdraw to reduce cpu usage since the mipmapping and LOD help a lot).
    I'm not done writing it so I can't compare the performances yet.
    About why distorted sprites have huge overdraw issue, it's because of how it does some kind of antialiasing while writing pixels to prevent "holes" in the texture.
    But it means that in some extreme situation you draw all the pixels twice per sprite.
    It also means that you can forget transparency on distorted quads.

    Also, the Saturn, like the 3DO, has no notion of UV coordinates to all the quads have 4 vertices with implicit texture coordinates : 0,0 1,0 1,1 0,1
    Of course, if you use a SW renderer you can forget all that.

    For the slave usage, SGL makes the slave process the transformations and drawing routine (on cpu side at least), so while it's not optimal it's quite good.
    But I suggest you use the hardware, you could pull way more this way.
    Just by playing a bit with SGL you will see how quickly you can get good results.
    But you could make good use of a small software renderer for effects like the transparency, just writing in a NBG0 or NBG1 bitmap layer and let the vdp2 pull the transparency.
     
  12. VladR

    VladR New Member

    Well, I guess Jaguar had a pretty smart design in that particular regard, as it was doing the translation to 16-bit color (from any bit depth : 1,2,4,8 bit) at runtime, during drawing of each picture line. It was a separate chip on ObjectProcessor.
    Especially for 4-bit, since it was natively reading 64-bits per one read (cycle, really), it meant reading 16 pixels per cycle, which is phenomenal throughput.
    And for flatshading, 16 colors can give you some nice base colors, so it is actually quite useable, and speed is just phenomenal.
    But, I'll shut up about 4-bit now...


    Yeah, I did the same kind of thing on PC, around ~2002, but it wasn't ~30 MHz CPU (more like 600 Athlon at the time), so I'm not sure I wouldn't consider octree a pretty hard load on sub-30 MHz CPU :)


    Yeah, my idea was to merge both HW&SW rasterizer, given that I already spent a great deal of effort on a RISC-based rasterizer on jaguar, so the code should be totally transferable.
    For generic texturing, I'd obviously leave it to VDP, but I also did some perspective-texturing stuff for axis-aligned quads (walls of buildings and floor/ceiling) that is running completely in SW, rasterizing picture line by line, preparing the current scanline within 4 KB cache, and in parallel drawing previous scanline via Blitter to Framebuffer.
    Given that it's not real texturing on Saturn, let alone perspective-correct, I suppose I would totally reuse that texturing code too.
    Still, for arbitrary textured polygons, I'd defer to VDP.


    Wait, so the slave SH2 is not fully available ? I just assumed its load was zero.
    So, Sega actually had some baseline multithreaded codebase for developers ? WOW, that's quite a pipe dream in jaguar land...



    Oh, yeah. The power of SW rasterizer, where you can do anything you imagine - just must be willing to pay the development cost :)

    Right this moment, I'm working on a 16-bit texturing that automatically applies antialiasing along the edges - as the code has to draw scanline by scanline, which is what takes majority of performance, you might as well do few additional reads and just apply antialiasing at a minimal cost.
    It also made me realize (as until now I was just working in 4-bit and 8-bit color space), that I can quickly adjust my line drawing routine to apply antialiasing there too.

    No such freedom with HW-based functionality, but then again - submitting an array of polys totally beats rolling your own 4 KB rasterizer in RISC :)
     
  13. XL2

    XL2 Member

    For software rendering, just use a NBG0 or NBG1 bitmap layer at 4 bpp, you can do the same thing you did with the Jaguar.
    Thr VDP2 will take care of filling the framebuffer.
    No need to write directly to the framebuffer.
     
    Ponut likes this.
  14. Absolutely agree whit this data. And add something important, the fact in the PSX side, is the same numbers. Really, SS and PSX was very similar machines in the time. The data in SegaRetro is generally amazing work. In some cases not are very accurate or put to win a one part. But I feel very thankful for the amount of data in this wiki.

    Finally, all this "war of numbers" is very boring. In fact, 3DO, Jaguar, SS and PSX. Are great pieces of Electronic technology for the time. And We cannot quantify yours values whit the today view of the 3D hardware. The TexelRate value is "almost" impossible to calculate in this Graphics Chips/systems. The effective PixelRate in the same way. Real polycount are there, in the games that were made. Is it possible to get some more number? Yes. Absolutely, but not a double or a millions of polygons that they said in the past. Because this numbers are been a big lie.

    For follow the upcoming updates:
    http://forum.jo-engine.org/index.php?topic=854.0

    258 games analyzed right now.
     
    Last edited: Oct 5, 2018

Share This Page