The first output file looks pretty good. The memory-to-memory copies I was worried about are all over the place now, so something I did must have fixed that problem. Go figure. Maybe the wierd register setup I had before discouraged that kind of optimization.
I see there's a need for a byte compare immediate instruction. I removed those peepholes since they caused problems, but it looks like I need to find a way to put something like them back.
An extra line is being emitted for "jump less than or equal". This won't affect the assembly, but looks bad.
Also, I should align the operands to start in the same column. Right now there's a single space between opcode and operands, which makes following things more difficult.
I should also try to find a better form for zero compares. I'm seeing lines like "movb @(b+94)(r10), @(b+94)(r10)" which looks mighty costly.
Amazingly, there's floating point in the 2K_chess code. There's a single constant buried in that mess.
Byte constants should use either hex or unsigned decimal. Currently we see signed decimal.
Gratuitous readelf output:
1404 lines of assembly
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000dda 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e0c 000dbc 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e0e 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000e4f 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000e4f 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 000fc0 000b10 10 7 146 4
[ 7] .strtab STRTAB 00000000 001ad0 00033a 00 0 0 1
Assuming everything is correct (seems to be), and before addressing the comments above, that's about 5% smaller code than I had before.
old: 1463 lines, 3668 .text bytes
new: 1404 lines, 3546 .text bytes