I got 32-bit multiply working, but it's awfully big. Since we only have instruction for 16-bit multiply, we need to expand the math.
R*G = (R0*K+R1)*(G0*K+G1) = R0*G0*K*K + R0*G1*K + R1*G0*K + R1*G1
At least we can omit the R0*G0 term since it won't fit into 32 bits. We will need a 32-bit temp value T stored in registers, and a 16-bit temp value H which could be stored in memory. That leaves us with this code:
mov r1, r5 # T0 = R0
mpy r4, r5 # T = R0 * G1
mov r6, r0 # H = T1
mov r2, r5 # T0 = R1
mpy r3, r5 # T = R1 * G0
mov r2, r1 # R0 = R1
mpy r4, r1 # R = R1 * G1
a r6, r1 # R1 += T1
a r0, r1 # R1 += H
Works fine, but this could be 30 bytes of code if the G and H terms are in memory. Not too bad if we only have one multiply per program, but for the compiler, I have to assume that won't be common.
So, it looks like 32-bit shifts, multiply, divide and modulo will be exiled to a library.
At long last, it looks like it's time for a release. I'm off to do cleanup.