Monday, June 27, 2011

I've fixed the 16-bit multiply instructions. For some reason, when using a multiply instructions with registers as both operands, they get swapped. So instead of "A=A*B", I see "A=B*A". This is annoying, since if I constantly use one argument for all configurations in the assembly output, I could end up with "A=A*A" in some cases. Fortunately, this is easily handled. At some point, I need to find out where the swap comes from, but I'm OK with leaving this as it is for now.

Fun fact: GCC really, really doesn't want to use multiply. If one of the operands is constant, it'll prefer permutations of shifts and adds to calculate the result, even if this would be much more expensive for the TMS9900.

There's true oddness for the divide instructions. The instructions I had before work just fine, but GCC is selecting the opposite function than the one called for. For instance, if I write "A/B", I get back "A%B", and vice versa. The wierd thing is that this doesn't seem to be a bug in my code, but GCC is knowingly doing this for some reason. Research awaits.

OK, I'm losing my mind apparently. The DIV and MOD instructions are working just fine. Moving on...

That completes the list of known problems in the coverage tests. I may have missed some unsigned variants, so I'll need to double-check that. There might be a few more 32-bit instructions I could do too, but I think I'm good right now.

Saturday, June 25, 2011

It looks like stack space is not being allocated for local variables when using -O0. This sounds familiar, so I'll need to go through the previous notes and see where things stand.

32-bit negate has been fixed, moving on.

The subtract opcode seems to be working wierd. The carry flag is being set opposite to what the carry-in would be. Odd, but not a big deal. Hopefully this is not an emulation bug, because I have no way of verifying this on real hardware right now.

I've also added optimizations for adding special constant values (-2,-1,0,1,2) in each of the words. Works fine.

Wednesday, June 22, 2011

I need to test 32-bit subtract constant to see if it's correct. Seems OK during code inspection.

I also need to look at multiply, it doesn't seem to be handling type conversion and register allocation correctly.

Of course, divide will likely be broken too...

32-bit negate also looks shady.

int-to-long conversion: (r1 -> r2, r3)
mov r1, r3 4+14 2
seto r2 4+10 2
jlt $+4 4+(10..8) 2
clr r2 (0..4+10) 2
Total: (46..58) 8

Looks costly, let's try this:
mov r1, r3 4+14 2
mov r1, r2 4+14 2
sar r2, 15 4+12+2*15 2
Total: 82 6

Ugh, no. I'm sticking to what I had before.

So, after all that testing, I'm left with four problems (32-bit subtract, 32-bit negate, multiply, divide). That seems doable.

Tuesday, June 21, 2011

Coverage testing continues. I'm down to the add instructions, and I've decided to remove "c *r1+, *r1+" for "r1+=4". The math for this is below.

c *r1+, *r1+ clocks: 14+8+8=30 bytes: 2
ai r1, 4 clocks: 14+4 =18 bytes: 4

So the "c" form is half as big, but takes almost twice as long. I'm thinking now that speed is preferrable to size in the general case. For space-constrained code, this still shouldn't matter much, since +4 isn't likely to be used very often. So out it goes. I've left that code commented out in the MD file just in case I change my mind.

Here's some more unexpected stuff. I made a function which was just "return(memval++);", where memval was stored in memory. That resulted in this code:
mov @memval, r1 14+8 4
mov r1, r2 14 2
inc r2 10 2
mov r2, @memval 14+8 2
b *r11 8 2
Total: 76 clocks 12 bytes

That should have been:
inc @memval 10+8 4
mov @memval, r1 14+8 4
b *r11 8 2
Total: 48 clocks 10 bytes

Actually, now that I think about it, that's correct. The C code returns the current value of "memval", then increments it. By changing the C code to "return(++memval);", returning the incremented value, I get this:
mov @memval, r1 14+8 4
inc r1 10 2
mov r1, @memval 14+8 4
b *r11 8 2
Total: 62 clocks 12 bytes

It's a little closer to the expected code above, but still clunky. I can see why this code came out though. GCC is trying to minimize the number of bus transactions, and keep as much work in the registers as possible. This is not as important for the TMS9900, and we end up with suboptimal code.

I might be able to fix this by tweaking weights in the H file, but I can do that later.

I'm skipping the DIV instructions for now, I don't want to get sucked into that mess right now. For that matter, I'm skipping all 32-bit instructions too. I'll come back when 8 and 16 bit code is correct.

Down to the sign and zero extend instructions now. Remember the SB trick to clear the upper byte? That might be handy now.
srl r0, 8 12+2*8 2
Total: 28 2

swpb r0 10 2
sb r0, r0 14 2
Total: 24 4

Maybe not. I guess I'll leave this alone.

The shift-and-cast peepholes are broken after the recent GCC changes. So this is a good time to reevaluate the left shift and cast code. I have two options:
srl r0, N+8 12+2*8+2*N 2
Total: 28+2*N 2

srl r0, N 12+2*N 2
swpb r0 10 2
Total: 22+2*N 4

I guess I'm sticking with a single instruction then. There is a break in the pattern for a left shift of one:
a r1, r1 14 2
swpb r1 10 2

This is 24 cycles, compared to 30 cycles for a single instruction, which doesn't look too bad except that there is an additional 4 cycles imposed for reading an instruction from memory. The revised numbers would then be 32 and 34, a lot closer, but still slightly faster. I'll leave that alone for now.

Next up: optimized byte initializers.

Monday, June 20, 2011

I'm glad I'm doing these coverage tests, since I've just found another problem with shift operations. Performing "(char)X<<=2" results in this sequence:

sra r1, 8 14+2*8 cycles 2 bytes
sla r1, 2 14+2*2 cycles 2 bytes
swpb r1 10 cycles 2 bytes
Total: 54 cycles, 6 bytes

Ick.
This has been fixed to match the unsigned shift code:

andi r1, >FF00 14+4 cycles 4 bytes
sla r1, 2 14+2*2 cycles 2 bytes
Total: 36 cycles, 6 bytes

Much better.

There's also a problem with AND, for some reason "a&b" is being converted to:
inv r1
szc r1, r1

The value stored in r2 is lost. It turns out that this instruction is generated by define_split. It converts a single AND to a NOT, AND_NOT sequence. Unfortunately, there was a typo in the extra step. Two of the three intermediate expressions were equivalent. Of course, those were the two passed on to define_insn, resulting in the lost register expression. All fixed now.
I'm going through the machine description file, making sure the emitted code is correct. I've yet to act on the earlier notes.

Right now, I'm looking at left shifted byte quantities. The problem here is that I can't just use a single left shift instruction because junk will just be shifted into the low order bits. It looks like I have two options:

Form 1: Mask off lower bits in word mode, then shift:
andi R1, >FF00 14+4 4 bytes
sla R1, N 12+2N 2 bytes
Total: 30+2N 6 bytes

Form 2: Swap bytes, shift, then restore
swpb R0 10 2 bytes
sla R1, N 12+2N 2 bytes
swpb R0 10 2 bytes
Total: 32+2N 6 bytes

Form 1 is slightly faster, and is more correct. The condition flags are left in the proper state. Form 2 would result in random condition flags, since SWPB does not modify flags, and SLA would set the flags based on the random contents of the unused byte. So I'll be using form one for this stuff.

Here's a neat idea I had while looking at the opcodes. I can clear the upper byte by SBing at register with itself. This saves two clocks and two bytes:
sb R0, R0 14 cycles, 2 bytes
andi R0, >00FF 14+2 cycles, 4 bytes

I can't use this right now, but I added the idea here so I wouldn't forget.

Wednesday, June 15, 2011

So, just how costly is that "movb @(b+94)(r10), @(b+94)(r10)" instruction?
14+8+8 = 30 clocks, 6 bytes

We can do better than that.

mov @(b+94)(r10), r0 : 14+8 = 22 clocks, 4 bytes

That's about 25% faster and 30% smaller, with the added cost of a temp register. That's not bad.

I took a look at the machine description file, and I really need to go through it line by line. I've seen a few bugs in uncommon branches. Also, that file is a mess since I've tried so many different approaches. Additionally, the 32-bit instructions have not been exercised much and could stand to be looked at again.

Also, look at "ai" emitters for byte quantities. I'm seeing stuff like ">FFE9 * 256", this just looks wrong.