32X variable bitrate audio proof-of-concept

Could SH2 decode a format similar to MP3? Maybe. This was my experiment:

1) Use DCT II to convert a block of samples to the frequency domain.
2) Use quantization and psychoacoustic model to prune the data. (this part is not even done properly)
3) Use entropy coding to store coefficients in a compressed data stream. (room for improvement here)

Result is 22KHz mono sound at avg. 93000 bits/second. The 32X can decode this on one CPU, using an unrolled loop of MAC.W instructions for the DCT III.

 
Very nice to see someone experimenting with the 32X. That doc you include... it's VERY outdated... even wrong in places. I'd update your docs and tools if I were you.

Anywho, here's an experiment I did in 32X audio using Tremor to decode ogg:


I also did a test using the Siren audio codec, but with the patent restrictions on Siren, I'm not sure I can publicly post the code. That did mono and joint stereo at 32kbps. It's a rather nice codec, but the patents limit it's potential in homebrew. Maybe when the patents expire...
 
I am shocked that mediafire is a) still around and b) actually let me download a file without JS enabled...

How much CPU load do you figure for decoding the Vorbis? For my program I estimated about 14MHz to do the DCT III on blocks of 128 samples (172 blocks/second). I have only glanced at the Vorbis spec so I'm not sure what their block/window/frame size is, but as it uses MDCT I would expect it spends a similarly large proportion of its time doing multiply-adds.
 
I didn't optimize tremor with much assembly, so there's room for improvement. It handles mono at 24 to 64 kbps, but it would really need optimizing to get stereo going on it. SirenAC was better in terms of cpu load, even without optimizing. I'd like to see what you get with your assembly DCT plugged into tremor, if you can figure that out.

One thing about the branch I used - I used the low-memory branch of tremor, which has a leak in the memory handling somewhere. I didn't bother to track it down; instead, I just reinitialize the heap used for tremor memory allocation each time through. In general, without good assembly language improvements to parts like the DCT, most DCT based codecs are going to start maxing out at around 64 kbps. Emulators would go higher, but then the emulators have always exaggerated the speed of the 32X somewhat. They don't emulate some of the things that slow real hardware down, like cache line fill, bus speeds, bus collisions, etc.

BTW, I used to use FileDen, but then when they shut down, I had to move a lot of stuff over to MediaFire. I'm as surprised as you that they're still around. If you want some of my examples, just PM me and I can get you some arcs.
 
Looking into it a bit more, it seems that I'm late to the party as I didn't even know about O(N log N) algorithms for computing the DCT.

That's okay, I knew even less than that about DCT algorithms. :confused2: I took one look at the C routine and noped right out of trying to make an SH2 assembly routine. Funny enough, I'm quite skilled at Walsh-Hadamard transforms. I keep going back and looking into a codec based on that for old consoles like the 32X/Saturn.
 
Could SH2 decode a format similar to MP3? Maybe. This was my experiment:

1) Use DCT II to convert a block of samples to the frequency domain.
2) Use quantization and psychoacoustic model to prune the data. (this part is not even done properly)
3) Use entropy coding to store coefficients in a compressed data stream. (room for improvement here)

Result is 22KHz mono sound at avg. 93000 bits/second. The 32X can decode this on one CPU, using an unrolled loop of MAC.W instructions for the DCT III.


Finally got around to trying the demo. It sounds really good. Not really noisy at all. One thing I'd suggest - use word dma targeting the MONO PWM register. That way it plays in mono rather than just the right side.
 
You might also want to try optimizing your code to exploit the peculiarities of SH-2's pipeline that allow you to do 2 multiples in 5 cycles instead of 6. The SH2 has odd pipeline where one multiply cost 3 cycles, but if you start a second multiply it only cost 2 clocks more.

1642929242797.png
 
I didn't optimize tremor with much assembly, so there's room for improvement.
One obvious optimization would be removing all mallocs/free and memcpy calls. In my tests those have proven to be _extremely_ slow on real hardware. Unfortunately that would not be easy since the read callback relies on that you copy stream data into the provided buffer. Changing that would require hacking the ogg library interface.
 
This paper looked like it would be easier to read than C code but in the end I still wasn't able to replicate it, so I guess I am sticking with the blue-collar algorithm :) Knocking the block size down from 128 samples to 64 frees up some CPU time to allow for stereo. Interleaving the processing of the second channel with the first can reuse some LUT data while it is still cached instead of hitting SDRAM again which is good too. Then I started to add huffman coding. Now the CPU is pushed right to the limit, with a few pops and skips happening in Fusion. Bitrate is still high but quality sounds good.
 

Attachments

  • SND32H.ZIP
    1.7 MB · Views: 166
Back
Top