Monday, June 27, 2011

I've fixed the 16-bit multiply instructions. For some reason, when using a multiply instructions with registers as both operands, they get swapped. So instead of "A=A*B", I see "A=B*A". This is annoying, since if I constantly use one argument for all configurations in the assembly output, I could end up with "A=A*A" in some cases. Fortunately, this is easily handled. At some point, I need to find out where the swap comes from, but I'm OK with leaving this as it is for now.

Fun fact: GCC really, really doesn't want to use multiply. If one of the operands is constant, it'll prefer permutations of shifts and adds to calculate the result, even if this would be much more expensive for the TMS9900.

There's true oddness for the divide instructions. The instructions I had before work just fine, but GCC is selecting the opposite function than the one called for. For instance, if I write "A/B", I get back "A%B", and vice versa. The wierd thing is that this doesn't seem to be a bug in my code, but GCC is knowingly doing this for some reason. Research awaits.

OK, I'm losing my mind apparently. The DIV and MOD instructions are working just fine. Moving on...

That completes the list of known problems in the coverage tests. I may have missed some unsigned variants, so I'll need to double-check that. There might be a few more 32-bit instructions I could do too, but I think I'm good right now.

Saturday, June 25, 2011

It looks like stack space is not being allocated for local variables when using -O0. This sounds familiar, so I'll need to go through the previous notes and see where things stand.

32-bit negate has been fixed, moving on.

The subtract opcode seems to be working wierd. The carry flag is being set opposite to what the carry-in would be. Odd, but not a big deal. Hopefully this is not an emulation bug, because I have no way of verifying this on real hardware right now.

I've also added optimizations for adding special constant values (-2,-1,0,1,2) in each of the words. Works fine.

Wednesday, June 22, 2011

I need to test 32-bit subtract constant to see if it's correct. Seems OK during code inspection.

I also need to look at multiply, it doesn't seem to be handling type conversion and register allocation correctly.

Of course, divide will likely be broken too...

32-bit negate also looks shady.

int-to-long conversion: (r1 -> r2, r3)
mov r1, r3 4+14 2
seto r2 4+10 2
jlt $+4 4+(10..8) 2
clr r2 (0..4+10) 2
Total: (46..58) 8

Looks costly, let's try this:
mov r1, r3 4+14 2
mov r1, r2 4+14 2
sar r2, 15 4+12+2*15 2
Total: 82 6

Ugh, no. I'm sticking to what I had before.

So, after all that testing, I'm left with four problems (32-bit subtract, 32-bit negate, multiply, divide). That seems doable.

Tuesday, June 21, 2011

Coverage testing continues. I'm down to the add instructions, and I've decided to remove "c *r1+, *r1+" for "r1+=4". The math for this is below.

c *r1+, *r1+ clocks: 14+8+8=30 bytes: 2
ai r1, 4 clocks: 14+4 =18 bytes: 4

So the "c" form is half as big, but takes almost twice as long. I'm thinking now that speed is preferrable to size in the general case. For space-constrained code, this still shouldn't matter much, since +4 isn't likely to be used very often. So out it goes. I've left that code commented out in the MD file just in case I change my mind.

Here's some more unexpected stuff. I made a function which was just "return(memval++);", where memval was stored in memory. That resulted in this code:
mov @memval, r1 14+8 4
mov r1, r2 14 2
inc r2 10 2
mov r2, @memval 14+8 2
b *r11 8 2
Total: 76 clocks 12 bytes

That should have been:
inc @memval 10+8 4
mov @memval, r1 14+8 4
b *r11 8 2
Total: 48 clocks 10 bytes

Actually, now that I think about it, that's correct. The C code returns the current value of "memval", then increments it. By changing the C code to "return(++memval);", returning the incremented value, I get this:
mov @memval, r1 14+8 4
inc r1 10 2
mov r1, @memval 14+8 4
b *r11 8 2
Total: 62 clocks 12 bytes

It's a little closer to the expected code above, but still clunky. I can see why this code came out though. GCC is trying to minimize the number of bus transactions, and keep as much work in the registers as possible. This is not as important for the TMS9900, and we end up with suboptimal code.

I might be able to fix this by tweaking weights in the H file, but I can do that later.

I'm skipping the DIV instructions for now, I don't want to get sucked into that mess right now. For that matter, I'm skipping all 32-bit instructions too. I'll come back when 8 and 16 bit code is correct.

Down to the sign and zero extend instructions now. Remember the SB trick to clear the upper byte? That might be handy now.
srl r0, 8 12+2*8 2
Total: 28 2

swpb r0 10 2
sb r0, r0 14 2
Total: 24 4

Maybe not. I guess I'll leave this alone.

The shift-and-cast peepholes are broken after the recent GCC changes. So this is a good time to reevaluate the left shift and cast code. I have two options:
srl r0, N+8 12+2*8+2*N 2
Total: 28+2*N 2

srl r0, N 12+2*N 2
swpb r0 10 2
Total: 22+2*N 4

I guess I'm sticking with a single instruction then. There is a break in the pattern for a left shift of one:
a r1, r1 14 2
swpb r1 10 2

This is 24 cycles, compared to 30 cycles for a single instruction, which doesn't look too bad except that there is an additional 4 cycles imposed for reading an instruction from memory. The revised numbers would then be 32 and 34, a lot closer, but still slightly faster. I'll leave that alone for now.

Next up: optimized byte initializers.

Monday, June 20, 2011

I'm glad I'm doing these coverage tests, since I've just found another problem with shift operations. Performing "(char)X<<=2" results in this sequence:

sra r1, 8 14+2*8 cycles 2 bytes
sla r1, 2 14+2*2 cycles 2 bytes
swpb r1 10 cycles 2 bytes
Total: 54 cycles, 6 bytes

Ick.
This has been fixed to match the unsigned shift code:

andi r1, >FF00 14+4 cycles 4 bytes
sla r1, 2 14+2*2 cycles 2 bytes
Total: 36 cycles, 6 bytes

Much better.

There's also a problem with AND, for some reason "a&b" is being converted to:
inv r1
szc r1, r1

The value stored in r2 is lost. It turns out that this instruction is generated by define_split. It converts a single AND to a NOT, AND_NOT sequence. Unfortunately, there was a typo in the extra step. Two of the three intermediate expressions were equivalent. Of course, those were the two passed on to define_insn, resulting in the lost register expression. All fixed now.
I'm going through the machine description file, making sure the emitted code is correct. I've yet to act on the earlier notes.

Right now, I'm looking at left shifted byte quantities. The problem here is that I can't just use a single left shift instruction because junk will just be shifted into the low order bits. It looks like I have two options:

Form 1: Mask off lower bits in word mode, then shift:
andi R1, >FF00 14+4 4 bytes
sla R1, N 12+2N 2 bytes
Total: 30+2N 6 bytes

Form 2: Swap bytes, shift, then restore
swpb R0 10 2 bytes
sla R1, N 12+2N 2 bytes
swpb R0 10 2 bytes
Total: 32+2N 6 bytes

Form 1 is slightly faster, and is more correct. The condition flags are left in the proper state. Form 2 would result in random condition flags, since SWPB does not modify flags, and SLA would set the flags based on the random contents of the unused byte. So I'll be using form one for this stuff.

Here's a neat idea I had while looking at the opcodes. I can clear the upper byte by SBing at register with itself. This saves two clocks and two bytes:
sb R0, R0 14 cycles, 2 bytes
andi R0, >00FF 14+2 cycles, 4 bytes

I can't use this right now, but I added the idea here so I wouldn't forget.

Wednesday, June 15, 2011

So, just how costly is that "movb @(b+94)(r10), @(b+94)(r10)" instruction?
14+8+8 = 30 clocks, 6 bytes

We can do better than that.

mov @(b+94)(r10), r0 : 14+8 = 22 clocks, 4 bytes

That's about 25% faster and 30% smaller, with the added cost of a temp register. That's not bad.

I took a look at the machine description file, and I really need to go through it line by line. I've seen a few bugs in uncommon branches. Also, that file is a mess since I've tried so many different approaches. Additionally, the 32-bit instructions have not been exercised much and could stand to be looked at again.

Also, look at "ai" emitters for byte quantities. I'm seeing stuff like ">FFE9 * 256", this just looks wrong.
The first output file looks pretty good. The memory-to-memory copies I was worried about are all over the place now, so something I did must have fixed that problem. Go figure. Maybe the wierd register setup I had before discouraged that kind of optimization.

I see there's a need for a byte compare immediate instruction. I removed those peepholes since they caused problems, but it looks like I need to find a way to put something like them back.

An extra line is being emitted for "jump less than or equal". This won't affect the assembly, but looks bad.

Also, I should align the operands to start in the same column. Right now there's a single space between opcode and operands, which makes following things more difficult.

I should also try to find a better form for zero compares. I'm seeing lines like "movb @(b+94)(r10), @(b+94)(r10)" which looks mighty costly.

Amazingly, there's floating point in the 2K_chess code. There's a single constant buried in that mess.

Byte constants should use either hex or unsigned decimal. Currently we see signed decimal.

Gratuitous readelf output:

1404 lines of assembly
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000dda 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e0c 000dbc 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e0e 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000e4f 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000e4f 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 000fc0 000b10 10 7 146 4
[ 7] .strtab STRTAB 00000000 001ad0 00033a 00 0 0 1

Assuming everything is correct (seems to be), and before addressing the comments above, that's about 5% smaller code than I had before.

old: 1463 lines, 3668 .text bytes
new: 1404 lines, 3546 .text bytes

Tuesday, June 14, 2011

I've fixed the GO_IF_LEGITIMATE_ADDRESS macro to handle more complicated addresses, and removed some fake register instructions. Memory handling now looks much better.

I've confirmed that the @(1+2)R1 forms are only shown with R10, so these are offsets into structures stored on the stack, and would not be combined. This is because the two constants are distinct as far as GCC is concerned (base + offset). Knowing that, I can update the final address printing routine to combine these constants. Symbol offsets will have to be wrapped in parentheses.

Saturday, June 11, 2011

The "unrecognizable insn" problem is in fact caused by the memory constraints. This address construct slips by earlier checks, and it's during instruction recognition that things go bad. At this time, the recognition tree is walked and the last step is to confirm that the address form is valid. The GO_IF_LEGITIMATE_ADDRESS macro does not recognise this form, and so we see the "unrecognizable insn" error.

For some reason, constant memory expressions are not combined, and forms like @(1+2)(R1) or @(a+1)(R1) are seen late in the compilation process, causing mayhem.

Friday, June 3, 2011

I compiled a section of 2k_chess to exercise the new code and see where things stand. Things went really well except for this:

unrecognizable insn:
(insn 1253 482 483 85 2k_chess.c:111 (set (reg:QI 2 r2)
(mem/s/j:QI (plus:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 64 [0x40])) [4 %sfp+64 S2 A16])
(symbol_ref:HI ("b") )) [0 b S1 A8])) -1 (nil))

This instruction should result in "movb @b+64(R10), R2".

This should be handled by the code I have now, but for some reason it's not. Maybe I've goofed on the memory constraints and they are too narrow, disallowing "symbol+offset" forms. I guess I've got some research to do.