Coverage testing continues. I'm down to the add instructions, and I've decided to remove "c *r1+, *r1+" for "r1+=4". The math for this is below.
c *r1+, *r1+ clocks: 14+8+8=30 bytes: 2
ai r1, 4 clocks: 14+4 =18 bytes: 4
So the "c" form is half as big, but takes almost twice as long. I'm thinking now that speed is preferrable to size in the general case. For space-constrained code, this still shouldn't matter much, since +4 isn't likely to be used very often. So out it goes. I've left that code commented out in the MD file just in case I change my mind.
Here's some more unexpected stuff. I made a function which was just "return(memval++);", where memval was stored in memory. That resulted in this code:
mov @memval, r1 14+8 4
mov r1, r2 14 2
inc r2 10 2
mov r2, @memval 14+8 2
b *r11 8 2
Total: 76 clocks 12 bytes
That should have been:
inc @memval 10+8 4
mov @memval, r1 14+8 4
b *r11 8 2
Total: 48 clocks 10 bytes
Actually, now that I think about it, that's correct. The C code returns the current value of "memval", then increments it. By changing the C code to "return(++memval);", returning the incremented value, I get this:
mov @memval, r1 14+8 4
inc r1 10 2
mov r1, @memval 14+8 4
b *r11 8 2
Total: 62 clocks 12 bytes
It's a little closer to the expected code above, but still clunky. I can see why this code came out though. GCC is trying to minimize the number of bus transactions, and keep as much work in the registers as possible. This is not as important for the TMS9900, and we end up with suboptimal code.
I might be able to fix this by tweaking weights in the H file, but I can do that later.
I'm skipping the DIV instructions for now, I don't want to get sucked into that mess right now. For that matter, I'm skipping all 32-bit instructions too. I'll come back when 8 and 16 bit code is correct.
Down to the sign and zero extend instructions now. Remember the SB trick to clear the upper byte? That might be handy now.
srl r0, 8 12+2*8 2
Total: 28 2
swpb r0 10 2
sb r0, r0 14 2
Total: 24 4
Maybe not. I guess I'll leave this alone.
The shift-and-cast peepholes are broken after the recent GCC changes. So this is a good time to reevaluate the left shift and cast code. I have two options:
srl r0, N+8 12+2*8+2*N 2
Total: 28+2*N 2
srl r0, N 12+2*N 2
swpb r0 10 2
Total: 22+2*N 4
I guess I'm sticking with a single instruction then. There is a break in the pattern for a left shift of one:
a r1, r1 14 2
swpb r1 10 2
This is 24 cycles, compared to 30 cycles for a single instruction, which doesn't look too bad except that there is an additional 4 cycles imposed for reading an instruction from memory. The revised numbers would then be 32 and 34, a lot closer, but still slightly faster. I'll leave that alone for now.
Next up: optimized byte initializers.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment