I'm using CPU-DMAC, with 1-byte stride, fixed destination, and cycle-steal mode (burst is not supported?). I'm seeing terrible performance, if I transfer 4096 bytes in chunks of 64 bytes, I notice that the PC doesn't receive all 4096 bytes. For every 64-byte chunk transfer via CPU-DMAC, I poll for the TXE bit before starting the transfer. I noticed that all 4096 bytes are sent when I add in a wait of 100 iterations. A single 1-byte transfer (4096 DMA calls) works flawlessly. As does writing byte per byte. I don't have much of an understanding of what to look for. Is this just not possible? Is there a known amount of time to wait per transfer? Does this have to do with the SCU bus registers? Should I just stick to writing byte per byte?