So, just how costly is that "movb @(b+94)(r10), @(b+94)(r10)" instruction?
14+8+8 = 30 clocks, 6 bytes
We can do better than that.
mov @(b+94)(r10), r0 : 14+8 = 22 clocks, 4 bytes
That's about 25% faster and 30% smaller, with the added cost of a temp register. That's not bad.
I took a look at the machine description file, and I really need to go through it line by line. I've seen a few bugs in uncommon branches. Also, that file is a mess since I've tried so many different approaches. Additionally, the 32-bit instructions have not been exercised much and could stand to be looked at again.
Also, look at "ai" emitters for byte quantities. I'm seeing stuff like ">FFE9 * 256", this just looks wrong.