I found this sequence which is being used to clear a byte of memory:
clr r2 # clocks: 10 bytes: 2
movb r2, *r1 # clocks: 4+14+6 = 24 bytes: 2
total: 34 clocks, 4 bytes
For indexed memory locations:
clr r2 # clocks: 10 bytes: 2
movb r2, @1(r1) # clocks: 4+14+8= 26 Bytes:4
total: 36 clocks, 6 bytes
I can do better than that using subract instructions:
sb *r1, *r1 # clocks: 4+14+6+6 = 30 bytes: 2
total: 30 clocks, 2 bytes
sb @1(r1), @1(r1) # clocks: 4+14+8+8 = 34 bytes: 6
total: 34 clocks, 6 bytes
The first form is about 12% faster, the second is about 6% faster. This isn't a huge improvement, but it's still better. This is probably best added as a peephole.
And now it's in there.
There's also this sequence, which I'm not happy about. It's the result of:
unsigned char a = (((unsigned char)val & (char)0x0F) + (char)'0');
mov r2, r5
swpb r5
srl r5, 8
andi r5, >F
ai r5, >30
swpb r5
I think this would be better:
mov r2, r5
andi r5, >F
ai r5, >30
swpb r5
So I've added an optimization for (int)X = (unsigned char)((int)X). This replaces:
mov r2, r5 # clocks:14
swpb r5 # clocks:10
sra r5, 8 # clocks:12+16
# total=52 clocks
with:
mov r2, r5 # clocks: 14
andi r5, >00FF # clocks: 14+4
# total=32 clocks
This is nearly twice as fast, and in the case where no MOV is needed, even faster. This makes me happy again.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment