My new project is up!

mrkotfw · Mar 9, 2012

Chilly Willy said:
My toolchain is just a "standard" build of gcc. It's just most folks have no idea how to build cross-compilers, so I made a makefile to help build the two separate cross-compilers targeted at the right processors in a uniform manner. Then I coupled that with custom linker scripts and some skeleton code so that you have the basis for C and C++ programs ready to go. I just have a minimum of code... basic functions to init the screen and print to it... that sort of thing. So you have a nice up-to-date compiler for the CPUs, and skeleton code for use in porting or making your own code for the two main languages. Oh, my toolchain builds the obj-c and objc++ compilers, but I don't yet have examples for them. I should really do that some time.

If you check the thread over in the Genesis/SCD/32X forum, you'll see I have a step-by-step tutorial on building the cross-compilers, then the basic examples (TicTacToe in C and C++ for Genesis and 32X), and then a few ports I've done that can be compiled with the toolchain as a more complete example than the basic examples.

So for the Saturn, I'd need to alter my 32X linker scripts a little, change my crt0.s file a little, and allow for the difference in hardware for things like setting up the screen and printing (or more complicated things for the ports). A console like the Saturn is more complex on the hardware side of things, so I've held off a bit while I checked out the SDK available on the Saturn. I don't want to use the old SEGA SDKs (SGL/SBL) mainly because they're just object files compiled under really old compilers, making them incompatible with my toolchain. That and they're proprietary. Hence the need for lapetus or libyaul.

I have a crt0.s and linkerscript already in the repository. Unless there's things that you have in your linkerscript that I don't...

What plans/ideas do you have for the Saturn?

I'm looking to really create a solid 3D engine that avoids the issue all Saturn games had (that texture warping)

Chilly Willy · Mar 9, 2012

Piratero said:
I have a crt0.s and linkerscript already in the repository. Unless there's things that you have in your linkerscript that I don't...

What plans/ideas do you have for the Saturn?

I'm looking to really create a solid 3D engine that avoids the issue all Saturn games had (that texture warping)

The crt0.s and linkerscript work together to init the data and bss (if needed for the platform) and are particularly important for C++ startup, handling the execution of constructors (which also requires another file I include with the C++ source called crtstuff.c). In general, if a platform I'm looking at already has a crt0.s and linkerscript, I'll work from those as needed. I find that the crt0.s file is usually pretty good already (other than not having .init/.fini support), but the linker scripts are woefully out of date.

So I may or may not use the crt0.s and linker script you have... I figure they will probably need at least a few minor changes. Maybe they won't need any changes at all - I won't know until I check.

What I have planned for the Saturn... well, a few of the "standard" ports I always do for a platform, like Wolf3D and Doom. While the Saturn version of Doom is pretty good, it's still based on the PSX version, which means it's not quite the same as the PC version everyone loves so. So I'll make a nice PC port. It will require the 4M ram cart, but given how damn cheap the ARP carts are, that shouldn't be a problem.

I also plan to get my port of Tremor working. I figure it will be more useful on the Saturn than on the 32X.

I was also looking at a port of OpenJazz (Jazz Jackrabbit 1/2). There's a number of things that are possible on the Saturn due to the extra ram and the CD that aren't on the 32X due to lack of ram.

mrkotfw · Mar 10, 2012

Chilly Willy said:
The crt0.s and linkerscript work together to init the data and bss (if needed for the platform) and are particularly important for C++ startup, handling the execution of constructors (which also requires another file I include with the C++ source called crtstuff.c). In general, if a platform I'm looking at already has a crt0.s and linkerscript, I'll work from those as needed. I find that the crt0.s file is usually pretty good already (other than not having .init/.fini support), but the linker scripts are woefully out of date.

So I may or may not use the crt0.s and linker script you have... I figure they will probably need at least a few minor changes. Maybe they won't need any changes at all - I won't know until I check.

What I have planned for the Saturn... well, a few of the "standard" ports I always do for a platform, like Wolf3D and Doom. While the Saturn version of Doom is pretty good, it's still based on the PSX version, which means it's not quite the same as the PC version everyone loves so. So I'll make a nice PC port. It will require the 4M ram cart, but given how damn cheap the ARP carts are, that shouldn't be a problem.

I also plan to get my port of Tremor working. I figure it will be more useful on the Saturn than on the 32X.

I was also looking at a port of OpenJazz (Jazz Jackrabbit 1/2). There's a number of things that are possible on the Saturn due to the extra ram and the CD that aren't on the 32X due to lack of ram.

Hey, awesome idea! Add in C++ support to my linker script! That's what severely lacking. My crt0.S suboptimally clears the BSS/SBSS sections.

At this point, the biggest limiting factor is the ease of testing. I have a USB Data Link, but testing (viewing) is what is stopping me.

Any (cheap/good) capture cards or LCD (extremely cheap) that I can use?

Chilly Willy · Mar 10, 2012

Piratero said:
Hey, awesome idea! Add in C++ support to my linker script! That's what severely lacking. My crt0.S suboptimally clears the BSS/SBSS sections.

Well, the way I do it is to have .init and .fini sections in the linker script along with CTOR/DTOR lists. Then in crtstuff.c I put a function into .init and .fini that calls a function to go through the CTOR/DTOR lists. Then crt0.s needs to call __INIT__ and __FINI__.

crtstuff.c

Code:

/* C++ CTOR / DTOR handling */

#include <stdlib.h>

/* Used by exit procs */

void *__dso_handle = 0;

extern void __call_exitprocs (int code, void *ptr);

typedef void (*func_ptr) (void);

extern func_ptr __CTOR_LIST__[];

extern func_ptr __DTOR_LIST__[];

/* Do all constructors. */

static void __attribute__((used)) __do_global_ctors (void)

{

    do

    {

        unsigned int i, n = (unsigned int) __CTOR_LIST__[0];

        for (i = n; i >= 1; i--)

            __CTOR_LIST__[i] ();

    } while (0);

}

/* Do all destructors. */

static void __attribute__((used)) __do_global_dtors (void)

{

    do

    {

        unsigned int i, n = (unsigned int) __DTOR_LIST__[0];

        for (i = 0; i < n; i++)

            __DTOR_LIST__[i + 1] ();

    } while (0);

}

/* Add function to .init section.  */

static void __attribute__((used, section (".init"))) __std_startup (void)

{

    atexit (__do_global_dtors);         /* First added, last called.  */

    __do_global_ctors ();               /* Do all constructors. */

}

/* Add function to .fini section.  */

static void __attribute__((used, section (".fini"))) __std_cleanup (void)

{

    __call_exitprocs (0, NULL);

}

then in sh2_crt0.s after clearing the bcc for the main SH2 startup

Code:

! do all initializers

        mov.l   _master_do_init,r0

        jsr     @r0

        nop

...

! purge cache, turn it on, and run main()

        mov.l   _master_cctl,r0

        mov     #0x11,r1

        mov.b   r1,@r0

        mov.l   _master_go,r0

        jsr     @r0

        nop

! do all finishers

        mov.l   _master_do_fini,r0

        jsr     @r0

        nop

2:

        bra     2b

        nop

...

_master_go:

        .long   _main

_master_do_init:

        .long   __INIT_SECTION__

_master_do_fini:

        .long   __FINI_SECTION__

and finally, the ldscript looks like this

Code:

OUTPUT_ARCH(sh)

EXTERN (_start)

ENTRY (_start)

__DYNAMIC  =  0;

/*

 * The memory map look like this:

 * +--------------------+ <- 0x02000000

 * | .text              |

 * |                    |

 * |         __text_end |

 * +--------------------+

 * .                    .

 * .                    .

 * .                    .

 * +--------------------+ <- 0x06000000

 * | .data              | initialized data goes here

 * |                    |

 * |         __data_end |

 * +--------------------+

 * | .bss               |

 * |        __bss_start | start of bss, cleared by crt0

 * |                    |

 * |         __bss__end | start of heap, used by sbrk()

 * +--------------------+

 * .                    .

 * .                    .

 * .                    .

 * |            __stack | top of stack (for Master SH2)

 * +--------------------+ <- 0x0603FC00

 */

MEMORY

{

    rom (rx) : ORIGIN = 0x02000000, LENGTH = 0x00400000

    ram (wx) : ORIGIN = 0x06000000, LENGTH = 0x0003FC00

}

/*

 * Allocate the stack to be at the top of memory, since the stack

 * grows down

 */

PROVIDE (__stack = 0x0603FC00);

SECTIONS

{

  .text 0x02000000 :

  AT ( 0x00000000 )

  {

    __text_start = .;

    *(.text)

    *(.text.*)

    *(.gnu.linkonce.t.*)

    . = ALIGN(16);

    __INIT_SECTION__ = .;

    KEEP (*(.init))

    SHORT (0x000B)	/* rts */

    SHORT (0x0009)	/* nop */

    . = ALIGN(16);

    __FINI_SECTION__ = .;

    KEEP (*(.fini))

    SHORT (0x000B)	/* rts */

    SHORT (0x0009)	/* nop */

    *(.eh_frame_hdr)

    KEEP (*(.eh_frame))

    *(.gcc_except_table)

    KEEP (*(.jcr))

    . = ALIGN(16);

     __CTOR_LIST__ = .;

    ___CTOR_LIST__ = .;

    LONG((__CTOR_END__ - __CTOR_LIST__) / 4 - 2)

    KEEP (*(SORT(.ctors.*)))

    KEEP (*(.ctors))

    LONG(0)

    __CTOR_END__ = .;

    . = ALIGN(16);

    __DTOR_LIST__ = .;

    ___DTOR_LIST__ = .;

    LONG((__DTOR_END__ - __DTOR_LIST__) / 4 - 2)

    KEEP (*(SORT(.dtors.*)))

    KEEP (*(.dtors))

     LONG(0)

    __DTOR_END__ = .;

    *(.rdata)

    *(.rodata)

    *(.rodata.*)

    *(.gnu.linkonce.r.*)

    . = ALIGN(16);

    __text_end = .;

  } > rom

  __text_size = __text_end - __text_start;

  .data 0x06000000 :

  AT ( LOADADDR (.text) + SIZEOF (.text) )

  {

    __data_start = .;

    *(.data)

    *(.data.*)

    *(.gnu.linkonce.d.*)

    CONSTRUCTORS

    *(.lit8)

    *(.lit4)

    *(.sdata)

    *(.sdata.*)

    *(.gnu.linkonce.s.*)

    . = ALIGN(16);

    __data_end = .;

  } > ram

  __data_size = __data_end - __data_start;

  .bss :

  {

    __bss_start = .;

    *(.bss)

    *(.bss.*)

    *(.gnu.linkonce.b.*)

    *(.sbss)

    *(.sbss.*)

    *(.gnu.linkonce.sb.*)

    *(.scommon)

    *(COMMON)

    end = .;

    _end = ALIGN (16);

    __end = _end;

    __bss_end = .;

  } > ram

  __bss_size = __bss_end - __bss_start;

}

At this point, the biggest limiting factor is the ease of testing. I have a USB Data Link, but testing (viewing) is what is stopping me.

Any (cheap/good) capture cards or LCD (extremely cheap) that I can use?

Hmm - I'd recommend hitting a few yard sales and look for old, used TVs. A 20" color TV would work nicely with the Saturn, and I'd bet you could get one at a yard sale for $20 or less. LCDs capable of RF, composite, or svideo in cost more than plain RGB LCD panels. I've seen plain RGB LCD panels on sale at NewEgg for $50, but then you'd need some way to convert the Saturn video to VGA for the panel.

mrkotfw · Mar 10, 2012

Great. I'll add in support sometime later today.

Chilly Willy · Mar 10, 2012

On the linker script, this would clearly all just load into high ram. My 32X linker script (you may have noticed) put .text in rom and .data/.bss in sdram. For the Saturn, it would be more like

Code:

OUTPUT_ARCH(sh)

EXTERN (_start)

ENTRY (_start)

__DYNAMIC  =  0;

/*

 * The memory map look like this:

 * +--------------------+ <- 0x06002000

 * |                    |

 * |            __stack | top of stack (for Master SH2)

 * +--------------------+ <- 0x06004000

 * | .text              |

 * |                    |

 * |         __text_end |

 * +--------------------+

 * | .data              | initialized data goes here

 * |                    |

 * |         __data_end |

 * +--------------------+

 * | .bss               |

 * |        __bss_start | start of bss, cleared by crt0

 * |                    |

 * |         __bss__end | start of heap, used by sbrk()

 * +--------------------+

 * .                    .

 * .                    .

 * .                    .

 * +--------------------+ <- 0x06100000

 */

MEMORY

{

    ram (wx) : ORIGIN = 0x06004000, LENGTH = 0x000FC000

}

/*

 * The stack is just before the load address on the Saturn

 */

PROVIDE (__stack = 0x06004000);

SECTIONS

{

  .text 0x06004000 :

  AT ( 0x00000000 )

  {

    __text_start = .;

    *(.text)

    *(.text.*)

    *(.gnu.linkonce.t.*)

    . = ALIGN(16);

    __INIT_SECTION__ = .;

    KEEP (*(.init))

    SHORT (0x000B)	/* rts */

    SHORT (0x0009)	/* nop */

    . = ALIGN(16);

    __FINI_SECTION__ = .;

    KEEP (*(.fini))

    SHORT (0x000B)	/* rts */

    SHORT (0x0009)	/* nop */

    *(.eh_frame_hdr)

    KEEP (*(.eh_frame))

    *(.gcc_except_table)

    KEEP (*(.jcr))

    . = ALIGN(16);

     __CTOR_LIST__ = .;

    ___CTOR_LIST__ = .;

    LONG((__CTOR_END__ - __CTOR_LIST__) / 4 - 2)

    KEEP (*(SORT(.ctors.*)))

    KEEP (*(.ctors))

    LONG(0)

    __CTOR_END__ = .;

    . = ALIGN(16);

    __DTOR_LIST__ = .;

    ___DTOR_LIST__ = .;

    LONG((__DTOR_END__ - __DTOR_LIST__) / 4 - 2)

    KEEP (*(SORT(.dtors.*)))

    KEEP (*(.dtors))

     LONG(0)

    __DTOR_END__ = .;

    *(.rdata)

    *(.rodata)

    *(.rodata.*)

    *(.gnu.linkonce.r.*)

    . = ALIGN(16);

    __text_end = .;

  } > ram

  __text_size = __text_end - __text_start;

  .data :

  {

    __data_start = .;

    *(.data)

    *(.data.*)

    *(.gnu.linkonce.d.*)

    CONSTRUCTORS

    *(.lit8)

    *(.lit4)

    *(.sdata)

    *(.sdata.*)

    *(.gnu.linkonce.s.*)

    . = ALIGN(16);

    __data_end = .;

  } > ram

  __data_size = __data_end - __data_start;

  .bss :

  {

    __bss_start = .;

    *(.bss)

    *(.bss.*)

    *(.gnu.linkonce.b.*)

    *(.sbss)

    *(.sbss.*)

    *(.gnu.linkonce.sb.*)

    *(.scommon)

    *(COMMON)

    end = .;

    _end = ALIGN (16);

    __end = _end;

    __bss_end = .;

  } > ram

  __bss_size = __bss_end - __bss_start;

}

mrkotfw · Mar 10, 2012

Thanks! I had made those exact changes before your new post. Any particular copyright (your real name, e-mail?) you'd like? I'm also unsure of your licensing.

I just found a small (fast) library for malloc()/free() so I'm working on that as well. I'll have to add newlib hooks sometime.

I also have some ideas as to how to handle all VDP1 command tables (polygons, etc) using a simple mark/sweep garbage collector (for textures too). If you have any suggestions/ideas, I'd like to hear them.

Chilly Willy · Mar 10, 2012

Piratero said:
Thanks! I had made those exact changes before your new post. Any particular copyright (your real name, e-mail?) you'd like? I'm also unsure of your licensing.

Anything I do is MIT unless otherwise stated, or based off an existing base with its own license (obviously, I can't change GPL code to something else).

I've mentioned my licensing in a few places before, but I really should get around to sticking these things in the files. MIT or the new BSD license are fine with me - I want anything I do to be as useful to as many people as possible.

Joe Fenton <jlfenton65@gmail.com> is fine for the author/contact.

I just found a small (fast) library for malloc()/free() so I'm working on that as well. I'll have to add newlib hooks sometime.

Which one? I mostly use the standard allocator in libc, or msys - Simple Malloc & Free Functions - by malkia@digitald.com. I made two versions of msys that are identical, but meant to be used by the separate SH2s - msys for the Master SH2, and ssys for the Slave SH2. That avoids any cache coherency and locking needed to try to share allocators between the two processors.

I also have some ideas as to how to handle all VDP1 command tables (polygons, etc) using a simple mark/sweep garbage collector (for textures too). If you have any suggestions/ideas, I'd like to hear them.

Make a third allocator from msys called vsys for the VDP1, then just alloc blocks as needed, and reinit the zone to clear everything at once. Maybe FOUR allocators from msys: one for the Master SH2, one for the Slave SH2, one for VDP1, and one for VDP2.

I've thought of altering msys to take a zone for an input argument. Then you could have any number of zones. For VDP1, allocate a block from the main vram zone, then create a new zone using that block for tables and whatnot. Then you can clear everything associated with one zone without affecting the others. Zones inside zones...

Chilly Willy · Mar 10, 2012

Here's MSYS in case you haven't seen it before.

EDIT: This version has a new function I added - MSYS_Set() that allows you to set the zone. MSYS_Init() was altered to take a pointer to a zone struct to return the values of the initialized zone. So this MSYS allows you to make zones inside of zones using those features.

EDIT 2: I also added MSYS_Realloc() - it takes care of both special cases that can occur with realloc(), but is otherwise pretty dumb, allocating a new block, copying the data, then freeing the old block. I also renamed MSYS_Alloc to MSYS_Malloc and added MSYS_Calloc.

msys.c

Code:

/* Simple Malloc & Free Functions - by [email]malkia@digitald.com[/email] */

#include <string.h>

#define USED 1

typedef struct {

  unsigned size;

} UNIT;

typedef struct {

  UNIT* free;

  UNIT* heap;

} MSYS;

static MSYS msys;

static int msysNumAllocs;

static UNIT* compact( UNIT *p, unsigned nsize )

{

    unsigned bsize, psize;

    UNIT *best;

    best = p;

    bsize = 0;

    while( psize = p->size, psize )

    {

        if( psize & USED )

        {

            if( bsize != 0 )

            {

                best->size = bsize;

                if( bsize >= nsize )

                {

                    return best;

                }

            }

            bsize = 0;

            best = p = (UNIT *)( (unsigned)p + (psize & ~USED) );

        }

        else

        {

            bsize += psize;

            p = (UNIT *)( (unsigned)p + psize );

        }

    }

    if( bsize != 0 )

    {

        best->size = bsize;

        if( bsize >= nsize )

        {

            return best;

        }

    }

    return 0;

}

void MSYS_Free( void *ptr )

{

    if( ptr )

    {

        UNIT *p;

        p = (UNIT *)( (unsigned)ptr - sizeof(UNIT) );

        p->size &= ~USED;

    }

}

void *MSYS_Malloc( unsigned size )

{

    unsigned fsize;

    UNIT *p;

    msysNumAllocs++;

    if( size == 0 ) return 0;

    size  += 3 + sizeof(UNIT);

    size >>= 2;

    size <<= 2;

    if( msys.free == 0 || size > msys.free->size )

    {

        msys.free = compact( msys.heap, size );

        if( msys.free == 0 ) return 0;

    }

    p = msys.free;

    fsize = msys.free->size;

    if( fsize >= size + sizeof(UNIT) )

    {

        msys.free = (UNIT *)( (unsigned)p + size );

        msys.free->size = fsize - size;

    }

    else

    {

        msys.free = 0;

        size = fsize;

    }

    p->size = size | USED;

    return (void *)( (unsigned)p + sizeof(UNIT) );

}

void *MSYS_Calloc( unsigned cnt, unsigned size )

{

    void *ptr = MSYS_Malloc( cnt*size );

    if (ptr)

        memset(ptr, 0, cnt*size);

    return ptr;

}

void *MSYS_Realloc( void *mem, unsigned new_size )

{

    void *ptr;

    // special cases

    if (!mem)

        return MSYS_Malloc( new_size );

    if (!new_size)

    {

        MSYS_Free( mem );

        return MSYS_Malloc( new_size );

    }

    // just alloc a new block and copy

    ptr = MSYS_Malloc( new_size );

    if (ptr)

    {

        memcpy(ptr, mem, new_size);

        MSYS_Free( mem );

    }

    return ptr;

}

void MSYS_Init( void *heap, unsigned len, MSYS *ptr )

{

    msysNumAllocs = 0;

    len  += 3;

    len >>= 2;

    len <<= 2;

    msys.free = msys.heap = (UNIT *) heap;

    msys.free->size = msys.heap->size = len - sizeof(UNIT);

    *(unsigned *)((char *)heap + len - 4) = 0;

    if (ptr)

    {

        ptr->free = msys.free;

        ptr->heap = msys.heap;

    }

}

void MSYS_Compact( void )

{

    msys.free = compact( msys.heap, 0x7FFFFFFF );

}

void MSYS_Set( MSYS *ptr )

{

    msys.free = ptr->free;

    msys.heap = ptr->heap;

}

msys.h

Code:

#ifndef _MSYS_H_

#define _MSYS_H_

#ifdef __cplusplus

extern "C"

{

#endif /* __cplusplus */

typedef struct {

  unsigned size;

} UNIT;

typedef struct {

  UNIT* free;

  UNIT* heap;

} MSYS;

extern void MSYS_Free( void *ptr );

extern void *MSYS_Malloc( unsigned size );

extern void *MSYS_Calloc( unsigned cnt, unsigned size );

extern void *MSYS_Realloc( void *mem, unsigned size );

extern void MSYS_Init( void *heap, unsigned len, MSYS *ptr );

extern void MSYS_Compact( void );

extern void MSYS_Set( MSYS *ptr );

#ifdef __cplusplus

}

#endif /* __cplusplus */

#endif

mrkotfw · Mar 12, 2012

Chilly Willy said:
Anything I do is MIT unless otherwise stated, or based off an existing base with its own license (obviously, I can't change GPL code to something else).

I've mentioned my licensing in a few places before, but I really should get around to sticking these things in the files. MIT or the new BSD license are fine with me - I want anything I do to be as useful to as many people as possible.

Joe Fenton <jlfenton65@gmail.com> is fine for the author/contact.

Done. I will commit as soon as I actually test it out with some STL examples.

Chilly Willy said:
Which one? I mostly use the standard allocator in libc, or msys - Simple Malloc & Free Functions - by malkia@digitald.com. I made two versions of msys that are identical, but meant to be used by the separate SH2s - msys for the Master SH2, and ssys for the Slave SH2. That avoids any cache coherency and locking needed to try to share allocators between the two processors.

I'm using the one from: http://tlsf.baisoku.org/ I haven't tested it, but I'm going to make the necessary changes.

I would say that using locks would be best (that'll add in the framework for threads). Aside from that, why not make the code thread-safe by avoiding the use of global/static variables?

Chilly Willy said:
Make a third allocator from msys called vsys for the VDP1, then just alloc blocks as needed, and reinit the zone to clear everything at once. Maybe FOUR allocators from msys: one for the Master SH2, one for the Slave SH2, one for VDP1, and one for VDP2.

I've thought of altering msys to take a zone for an input argument. Then you could have any number of zones. For VDP1, allocate a block from the main vram zone, then create a new zone using that block for tables and whatnot. Then you can clear everything associated with one zone without affecting the others. Zones inside zones...

Yeah, I was thinking something similar except in a tree structure. You can have subtrees and such of command tables. The tree itself is kept in WORKRAM-H as well as all the command tables (there should be an upper bound on number of command tables in memory).

Priority and order is based on how the tree is to be traversed.

Before the entire tree of command tables is updated. That is, the only the ones that have changed in WORKRAM-H (essentially this tree is a cache of VDP1 VRAM) -- they're sorted properly by the LUT of transfers passed to either of the three SCU DMA levels.

Or they're sorted by using the linked list which tells VDP1 what command table to access next. Chances are, it's going to be a mixture of both.

Example:

So considering adding an X number of command tables in WORKRAM-H at address W. Starting at offset Y in VDP1 VRAM where the first command table is stored, I could traverse the tree and create a LUT of transfers for SCU-DMA (indirect mode):

src: W[X - 1], dst: Y[0], size: 32B

src: W[X - 2], dst: Y[1], size: 32B

src: W[0], dst: Y[2], size: 32B

src: W[1], dst: Y[3], size: 32B

src: W[4], dst: Y[4], size: 32B

src: W[5], dst: Y[5], size: 32B

src: W[6], dst: Y[6], size: 32B

And so on.

Now what if I want to update command table W[5] and delete W[6]? Update them in WORKRAM-H by writing a bit in W[6] that tells VDP1 to skip the command table. As for W[5], update whatever.

Then do another SCU-DMA transfer of only two transfers:

src: W[5], dst: Y[5], size: 32B

src: W[6], dst: Y[6], size: 32B

I'm going to have to keep track which have changed.

As for allocating memory, yeah that should be done by the standard malloc/free. Textures on the other hand could be done through garbage collection. For example, if I delete W[6] and it used a texture then decrement the ref. counter. Then put it back on the free/used list. I could use that standard allocator just for this purpose! Speed it up since the smallest we'll go is for a 8x1 4-bit texture (padded to be a 8x2 4-bit texture). And this code is in the public domain or MIT/BSD licensed?

What's difficult about texture allocation is the fact that we can allow texture sizes in the Y direction to be not in powers of 2. So we're going to waste some VDP1 VRAM by padding everything.

What do you think? Do you think this is viable for 3D?

mrkotfw · Mar 12, 2012

Chilly Willy said:
Here's MSYS in case you haven't seen it before.

EDIT: This version has a new function I added - MSYS_Set() that allows you to set the zone. MSYS_Init() was altered to take a pointer to a zone struct to return the values of the initialized zone. So this MSYS allows you to make zones inside of zones using those features.

EDIT 2: I also added MSYS_Realloc() - it takes care of both special cases that can occur with realloc(), but is otherwise pretty dumb, allocating a new block, copying the data, then freeing the old block. I also renamed MSYS_Alloc to MSYS_Malloc and added MSYS_Calloc.

msys.c

Code:

...

msys.h

Code:

...

Looks like your standard K&R used/free list allocator. If things go down hill with TLSF, I'll go with this much simpler implementation.

By the way, if there is any code you want to add, just clone and I'll be sure to merge your work in!

mrkotfw · Mar 12, 2012

One more update.

I'm really itching on just writing a DSP assembler in Python. All I need is to flesh out the BNF grammar, write my own lexer and parser (LL(1) parser because I'm lazy). Or I can just get that shit done by using a lexer/parser packaged in a nice module.

antime, if you're lurking... do you have an errata of the errors in the SCU manual found by you? Maybe I could just take a peek inside Yabause's disassembler! It also has a nice list of games using the DSP.

ExCyber · Mar 12, 2012

Piratero said:
One more update.

I'm really itching on just writing a DSP assembler in Python. All I need is to flesh out the BNF grammar, write my own lexer and parser (LL(1) parser because I'm lazy). Or I can just get that shit done by using a lexer/parser packaged in a nice module.

Most assemblers that I've used don't seem to even bother with a proper grammar; they seem to just go line-by-line and have fairly brittle parsing of each line. A lot of them will barf if an instruction line doesn't start with enough whitespace, for example.

Chilly Willy · Mar 12, 2012

Piratero said:
Done. I will commit as soon as I actually test it out with some STL examples.

One thing I found from my own C++ example - do not include iostream! The linker doesn't seem able to tell what code is used and what isn't due to the binary bits at the start of console binaries (at least not on the MD and 32X). That means that whatever you include is left in its entirety in the binary at the end... which means my 8K TicTacToe ballooned to several HUNDRED K. The iostream is HUGE, mainly because it deals with text in and out.

I'm using the one from: http://tlsf.baisoku.org/ I haven't tested it, but I'm going to make the necessary changes.

Oh, that's nice! I saw that in RockBox, but the license was different. I didn't see this one.

I would say that using locks would be best (that'll add in the framework for threads). Aside from that, why not make the code thread-safe by avoiding the use of global/static variables?

With old, slow consoles (and computers), it's often best to ignore as much of that as possible. Always locking/unlocking can kill your performance, and having the extra overhead of actual threads can cut speeds by a third or worse. That said, sometimes you NEED locking between two CPU (32X or Saturn). For example, my sound mixer for the 32X: the Master SH2 sets/changes the entries in the voice table, but the Slave goes through the table to do the mixing. Clearly, I need to lock the list from one side or the other for changes/mixing. The interface in the 32X is not designed handle TAS atomic bus cycles, so I wound up using one of the communications registers like this:

Code:

! void SVC_Lock(int16_t id)

! Entry: r4 = id

        .global _SVC_Lock

_SVC_Lock:

        exts.w  r4,r4

        mov.l   ss_svc_state,r1

0:

        mov.w   @r1,r0

        cmp/eq  #1,r0                   /* loop until unlocked */

        bf      0b

        mov.w   r4,@r1

        mov.w   @r1,r0

        cmp/eq  r4,r0

        bf      0b                      /* race condition - we lost */

        rts

        nop

! void SVC_Unlock(void)

        .global _SVC_Unlock

_SVC_Unlock:

        mov     #1,r0

        mov.l   ss_svc_state,r1

        rts

        mov.w   r0,@r1

That worked well - the communications registers are fast and uncached, and can be read and written by both CPUs at the same time. Since they CAN both write at the same time, it's not guaranteed which CPU will actually have it's data stored; hence the race condition check.

The Saturn doesn't have communications registers. I'm also not sure if any of the blocks of ram are capable of TAS - I haven't seen anything about that in any of the manuals yet. Code like above can be done on uncached ram, but would be slower if the ram is wired for burst read access.

Yeah, I was thinking something similar except in a tree structure. You can have subtrees and such of command tables. The tree itself is kept in WORKRAM-H as well as all the command tables (there should be an upper bound on number of command tables in memory).

Priority and order is based on how the tree is to be traversed.

Before the entire tree of command tables is updated. That is, the only the ones that have changed in WORKRAM-H (essentially this tree is a cache of VDP1 VRAM) -- they're sorted properly by the LUT of transfers passed to either of the three SCU DMA levels.

Or they're sorted by using the linked list which tells VDP1 what command table to access next. Chances are, it's going to be a mixture of both.

If you need extra info on how to go through a list of data, a linked list or tree is better. If you don't have that, just allocating blocks is probably better. Sounds like a list/tree is what you want here, from the example.

As for allocating memory, yeah that should be done by the standard malloc/free. Textures on the other hand could be done through garbage collection. For example, if I delete W[6] and it used a texture then decrement the ref. counter. Then put it back on the free/used list. I could use that standard allocator just for this purpose! Speed it up since the smallest we'll go is for a 8x1 4-bit texture (padded to be a 8x2 4-bit texture). And this code is in the public domain or MIT/BSD licensed?

Back when I was using it on the 32X for Tremor, I hunted around until I found a post from malkia where he told someone the code can be used any way they wish.

By the way, I'm sure you're familiar with this forum?

https://mollyrocket.com/forums/viewforum.php?f=16

That's one of the best sites for PD code for things. stb_image.c is one of the most widely used pieces of PD code out.

What's difficult about texture allocation is the fact that we can allow texture sizes in the Y direction to be not in powers of 2. So we're going to waste some VDP1 VRAM by padding everything.

What do you think? Do you think this is viable for 3D?

Sounds good so far. I'll go over it more thoroughly when you have more to go over.

If the scheme gets too wasteful, people can always just allocate a large block to cover all their data instead of handling it individually. Something to keep in mind - in 3D many folks put a BUNCH of different textures in the same texture block. That's because many GPUs require textures to be powers of two in both directions, meaning lots of waste for single textures in many cases... unless you pack more than one texture in the same block.

Chilly Willy · Mar 12, 2012

Piratero said:
Looks like your standard K&R used/free list allocator. If things go down hill with TLSF, I'll go with this much simpler implementation.

Yes, it's VERY simple - very small code base with extremely low ram overhead. It was perfect for the 32X given you need to keep both of those to a minimum. I used it on the Slave SH2 for allocations made by Tremor. Between songs, I'd just reinit the heap to free all the memory. The Tremor lowmem branch leaks ram, so the current recommendation is to use your own allocator for the Tremor allocations and reinit the heap after each song to make sure the leak doesn't propagate. If you don't, it runs out of ram on the third or fourth song depending on the allocations and the size of the heap.

By the way, if there is any code you want to add, just clone and I'll be sure to merge your work in!

I'm not as up on git in this area... too used to svn and cvs. I need to review the git manual on that.

antime · Mar 13, 2012

Piratero said:
antime, if you're lurking... do you have an errata of the errors in the SCU manual found by you? Maybe I could just take a peek inside Yabause's disassembler!

The only one I found was the alternative encoding for X-bus NOPs which I documented in my old disassembler. Yabause and MAMEs sources are a much better reference.

Chilly Willy said:
The Saturn doesn't have communications registers. I'm also not sure if any of the blocks of ram are capable of TAS - I haven't seen anything about that in any of the manuals yet.

The only warning given in the documentation regarding atomic operations is a prohibition on using the MC68000's TAS instruction. Sega's own libraries include a SYS_TASSEM function which is described as using the TAS instruction, but I haven't checked the actual implementation. Additionally memory locations 0x1000000-0x17FFFFF and 0x1800000-0x1FFFFFF are connected to the slave and master SH2's FRT input capture, respectively.

mrkotfw · Mar 13, 2012

ExCyber said:
Most assemblers that I've used don't seem to even bother with a proper grammar; they seem to just go line-by-line and have fairly brittle parsing of each line. A lot of them will barf if an instruction line doesn't start with enough whitespace, for example.

Yeah, I'm not going to do that. That just opens the possibility for more bugs. The grammar really shouldn't be difficult. I just need a better delimiter for when there are parallel instructions.

If I can't find a good parser, then a simple LL(1) will do granted that my grammar has no left-recursion (shouldn't be a problem).

Chilly Willy said:
One thing I found from my own C++ example - do not include iostream! The linker doesn't seem able to tell what code is used and what isn't due to the binary bits at the start of console binaries (at least not on the MD and 32X). That means that whatever you include is left in its entirety in the binary at the end... which means my 8K TicTacToe ballooned to several HUNDRED K. The iostream is HUGE, mainly because it deals with text in and out.

Oh, that's nice! I saw that in RockBox, but the license was different. I didn't see this one.

With old, slow consoles (and computers), it's often best to ignore as much of that as possible. Always locking/unlocking can kill your performance, and having the extra overhead of actual threads can cut speeds by a third or worse. That said, sometimes you NEED locking between two CPU (32X or Saturn). For example, my sound mixer for the 32X: the Master SH2 sets/changes the entries in the voice table, but the Slave goes through the table to do the mixing. Clearly, I need to lock the list from one side or the other for changes/mixing. The interface in the 32X is not designed handle TAS atomic bus cycles, so I wound up using one of the communications registers like this:

Code:

...

That worked well - the communications registers are fast and uncached, and can be read and written by both CPUs at the same time. Since they CAN both write at the same time, it's not guaranteed which CPU will actually have it's data stored; hence the race condition check.

The Saturn doesn't have communications registers. I'm also not sure if any of the blocks of ram are capable of TAS - I haven't seen anything about that in any of the manuals yet. Code like above can be done on uncached ram, but would be slower if the ram is wired for burst read access.

If you need extra info on how to go through a list of data, a linked list or tree is better. If you don't have that, just allocating blocks is probably better. Sounds like a list/tree is what you want here, from the example.

Back when I was using it on the 32X for Tremor, I hunted around until I found a post from malkia where he told someone the code can be used any way they wish.

By the way, I'm sure you're familiar with this forum?

https://mollyrocket.com/forums/viewforum.php?f=16

That's one of the best sites for PD code for things. stb_image.c is one of the most widely used pieces of PD code out.

Sounds good so far. I'll go over it more thoroughly when you have more to go over.

If the scheme gets too wasteful, people can always just allocate a large block to cover all their data instead of handling it individually. Something to keep in mind - in 3D many folks put a BUNCH of different textures in the same texture block. That's because many GPUs require textures to be powers of two in both directions, meaning lots of waste for single textures in many cases... unless you pack more than one texture in the same block.

...binary bits? Not sure what you mean.

Thanks for the link. I never heard of that place. And thanks for the lock code. As for textures, that's what I was thinking. A large collection of unique textures could be allocated by simply allocating a large single texture. There's the upside of not having to do more than 1 allocation (that is if all the textures you need are within that large collection of textures).

Yeah, a tree would be best since

Chilly Willy said:
Yes, it's VERY simple - very small code base with extremely low ram overhead. It was perfect for the 32X given you need to keep both of those to a minimum. I used it on the Slave SH2 for allocations made by Tremor. Between songs, I'd just reinit the heap to free all the memory. The Tremor lowmem branch leaks ram, so the current recommendation is to use your own allocator for the Tremor allocations and reinit the heap after each song to make sure the leak doesn't propagate. If you don't, it runs out of ram on the third or fourth song depending on the allocations and the size of the heap.

I'm not as up on git in this area... too used to svn and cvs. I need to review the git manual on that.

If you have Github, they have documentation on how you can set this up and track my repos. If you commit/push in your own, I can merge your work into mine and vice versa.

antime said:
The only one I found was the alternative encoding for X-bus NOPs which I documented in my old disassembler. Yabause and MAMEs sources are a much better reference.

The only warning given in the documentation regarding atomic operations is a prohibition on using the MC68000's TAS instruction. Sega's own libraries include a SYS_TASSEM function which is described as using the TAS instruction, but I haven't checked the actual implementation. Additionally memory locations 0x1000000-0x17FFFFF and 0x1800000-0x1FFFFFF are connected to the slave and master SH2's FRT input capture, respectively.

Got it. I'll be looking at both for reference/documentation as well as your disassembler.

mrkotfw · Mar 13, 2012

antime, I'm interested in getting your transfer tool to work. http://koti.kapsi.fi/~antime/sega/usb.html

Do you have a pin-out of what needs to be soldered onto the Saturn serial port? Images of your set up and where/what you soldered would be great.

Basically, any schematics? I'd like to incorporate this into libyaul and hopefully get a GDB-stub going as well. I'm not particularly fond of the AR cartridge. I can't seem to find the original thread (if there ever was one) about your transfer tool.

antime · Mar 13, 2012

Looking at the bottom of the Saturn mainboard, the link port's pinout is

Code:

    Vcc

Midi RxD

    Midi TxD

Master TxD

    Master RxD

Master SCK

    GND

Slave TxD

    Slave RxD

Slave SCK

    GND

I just soldered wires directly to the corresponding TxD/RxD/SCK pins of a DLP-2232M module. Unfortunately the FT2232 IC uses incompatible pinouts in asynchronous and synchronous serial modes, so you have to decide which one you want to use on channel A. I used synchronous mode because of the faster transfer speed (~1.2Mbit/s vs. ~224Kbit/s), but async is a lot easier to work with.

Chilly Willy · Mar 13, 2012

Piratero said:
...binary bits? Not sure what you mean.

The rom header, mainly. The 32X has two headers - the normal MD rom header, and the 32X header for the SH2s. It also has an exception jump table that replaces the regular exception vectors.

Thanks for the link. I never heard of that place. And thanks for the lock code. As for textures, that's what I was thinking. A large collection of unique textures could be allocated by simply allocating a large single texture. There's the upside of not having to do more than 1 allocation (that is if all the textures you need are within that large collection of textures).

If you need some general purpose jpg/png code, stb_image.c is really good. Your alternative is libjpeg and libpng, both of which are rather complicated to use - they don't have a single entry point that returns an image, you have to read in the image one line at a time. I haven't used his TTF code yet, but it looks better/easier than corresponding TTF library.

If you have Github, they have documentation on how you can set this up and track my repos. If you commit/push in your own, I can merge your work into mine and vice versa.

I'll look into that.

My new project is up!

Mid Boss

Established Member

Mid Boss

Established Member

Mid Boss

Established Member

Mid Boss

Established Member

Established Member

Mid Boss

Mid Boss

Mid Boss

Established Member

Established Member

Extra Hard Mid Boss

Mid Boss

Mid Boss

Extra Hard Mid Boss

Established Member

Similar threads