For the past couple of days I've been fighting with the compiler to try to get a working MULHI3 pattern. This is a 16-bit multiply with a 16-bit result. I already have a working multiply pattern, but it's awfully fiddly, and it's more complicated than I think it should be. In addition to all that, Lucien has a bug which shows up because of the missing MULHI3 pattern.
To make this work, I've been playing with using "define_expand" instead of "define_insn" in the machine description file. These are pretty similar, but I can have more control over temporary variables by using "define_expand". This allows more of the work I was doing in the existing multiply pattern to be pushed out into RTL expressions. At that point, GCC can automatically handle the special cases I was previously using brute force to handle. Most importantly, I won't have to be concernend about adding a bunch of superfluous MOVs or XORs to make everything work. GCC will just conveniently pick registers to make everything work without all that junk.
Monday, April 30, 2012
Saturday, April 21, 2012
Lucien on AtariAge has found a lot of problems in the compiler. Time to get back into bugfix mode.
First off there's a problem with stack construction. Apparently I was double counting the saved register space in the stack frame. The local variables are allocated from the start of the frame. I was confused and thought that the frame was required to be at the bottom of the stack, and so I made space for the saved registers, which are stored below the local variables.
This caused all frame local variables to be misaddressed, corrupting some saved registers. Lucien provided a fix, which was mostly right, but miscalculated the number of saved registers in some optimization modes. I think that's the last time the stack code needs to be changed.
There was another problem in the assembler, I thought i was being clever in allowing either TI-style single quotes or C-style double quotes for string constants. That code worked fine for C-style strings, but always closed the string when TI-style strings were used. This is a problem since GCC was configured to always use TI-style strings for the emitted assembly. This means there was no way to use single quotes in compiled code.
This was fixed by rewriting a good chunk of the string handling code in GAS. Now if a TI-style quote is used ('like this'), no escape sequences will be recognized except for the sequence '' which is converted to one single quote character. If C-style quotes are used ("like this"), all standard escape sequences are recognized. This allows assembly authors to chose the style they prefer while remaining compatible with the TI conventions. GCC always uses TI-style strings, so users will see their expected output.
I think there's another five or so GCC bugs in Lucien's latest game. I'll start going through it and see what I can do.
First off there's a problem with stack construction. Apparently I was double counting the saved register space in the stack frame. The local variables are allocated from the start of the frame. I was confused and thought that the frame was required to be at the bottom of the stack, and so I made space for the saved registers, which are stored below the local variables.
This caused all frame local variables to be misaddressed, corrupting some saved registers. Lucien provided a fix, which was mostly right, but miscalculated the number of saved registers in some optimization modes. I think that's the last time the stack code needs to be changed.
There was another problem in the assembler, I thought i was being clever in allowing either TI-style single quotes or C-style double quotes for string constants. That code worked fine for C-style strings, but always closed the string when TI-style strings were used. This is a problem since GCC was configured to always use TI-style strings for the emitted assembly. This means there was no way to use single quotes in compiled code.
This was fixed by rewriting a good chunk of the string handling code in GAS. Now if a TI-style quote is used ('like this'), no escape sequences will be recognized except for the sequence '' which is converted to one single quote character. If C-style quotes are used ("like this"), all standard escape sequences are recognized. This allows assembly authors to chose the style they prefer while remaining compatible with the TI conventions. GCC always uses TI-style strings, so users will see their expected output.
I think there's another five or so GCC bugs in Lucien's latest game. I'll start going through it and see what I can do.
Monday, April 9, 2012
Since there were no more problems, I've gone back to printf. To make any more progress, I need to get 32-bit division working. I mentioned this approach earlier, but after I went over those notes, I found some problems.
So just to be sure, I derived everything from scratch. The notes here are the highlights, since I don't want to reproduce the three pages I needed for the proof here.
The idea is to transform the fraction to ensure a 16-bit denominator. This allows the use of the DIV instruction, and skips the more commonly-seen but much slower shift-and-subtract method usually used for this kind of math.
Let's start at the beginning. Assume positive numerator and denominator.
num
----- = Q
den
The denominator is 32-bit, so for non-zero results, the numerator must be 32-bits as well. The "n" constant is 0x10000.
a*n+b
------- = Q
c*n+d
Multiply by (1/c)/(1/c), rename to P to acknowledge the truncation error
(a*n+b)/c
----------- = P
n+(d/c)
Multiply by (1/2)/(1/2) to keep 16-bit denominator.
((a*n+b)/2)/c
----------------- = P
(n/2)+((d/2)/c)
Find the remainder to correct for truncation error. Given the requirements on the inputs, P will be no more than 16 bits wide.
    
(a*n+b) - P(c*n+d) = R
The bulk of the proof was in determining that this was the only correction needed over the entire domain of valid inputs:
If R<0 then decrement P by one
Here's the resulting C code. It's been verified on a PC using Monte Carlo methods over the valid domain. I just need to convert to TMS9900 assembly and count some clocks.
long div(unsigned long num, unsigned long den)
{
unsigned short c=den>>16;
unsigned short d=den&0xFFFF;
unsigned short v=0x8000+(d/2)/c;
unsigned long u=(num/2)/c;
unsigned long p = u/v;
long r = num-p*den;
if(r<0)
{r+=den; p--;}
return(p);
}
This requires one 32x16 division, one 32x32 multiply, a 32-bit add and a 32-bit subtract, All other operations can be made using 16-bit values.
Not too shabby.
So just to be sure, I derived everything from scratch. The notes here are the highlights, since I don't want to reproduce the three pages I needed for the proof here.
The idea is to transform the fraction to ensure a 16-bit denominator. This allows the use of the DIV instruction, and skips the more commonly-seen but much slower shift-and-subtract method usually used for this kind of math.
Let's start at the beginning. Assume positive numerator and denominator.
num
----- = Q
den
The denominator is 32-bit, so for non-zero results, the numerator must be 32-bits as well. The "n" constant is 0x10000.
a*n+b
------- = Q
c*n+d
Multiply by (1/c)/(1/c), rename to P to acknowledge the truncation error
(a*n+b)/c
----------- = P
n+(d/c)
Multiply by (1/2)/(1/2) to keep 16-bit denominator.
((a*n+b)/2)/c
----------------- = P
(n/2)+((d/2)/c)
Find the remainder to correct for truncation error. Given the requirements on the inputs, P will be no more than 16 bits wide.
(a*n+b) - P(c*n+d) = R
The bulk of the proof was in determining that this was the only correction needed over the entire domain of valid inputs:
If R<0 then decrement P by one
Here's the resulting C code. It's been verified on a PC using Monte Carlo methods over the valid domain. I just need to convert to TMS9900 assembly and count some clocks.
long div(unsigned long num, unsigned long den)
{
unsigned short c=den>>16;
unsigned short d=den&0xFFFF;
unsigned short v=0x8000+(d/2)/c;
unsigned long u=(num/2)/c;
unsigned long p = u/v;
long r = num-p*den;
if(r<0)
{r+=den; p--;}
return(p);
}
This requires one 32x16 division, one 32x32 multiply, a 32-bit add and a 32-bit subtract, All other operations can be made using 16-bit values.
Not too shabby.
Sunday, April 1, 2012
I found this sequence which is being used to clear a byte of memory:
clr r2 # clocks: 10 bytes: 2
movb r2, *r1 # clocks: 4+14+6 = 24 bytes: 2
total: 34 clocks, 4 bytes
For indexed memory locations:
clr r2 # clocks: 10 bytes: 2
movb r2, @1(r1) # clocks: 4+14+8= 26 Bytes:4
total: 36 clocks, 6 bytes
I can do better than that using subract instructions:
sb *r1, *r1 # clocks: 4+14+6+6 = 30 bytes: 2
total: 30 clocks, 2 bytes
sb @1(r1), @1(r1) # clocks: 4+14+8+8 = 34 bytes: 6
total: 34 clocks, 6 bytes
The first form is about 12% faster, the second is about 6% faster. This isn't a huge improvement, but it's still better. This is probably best added as a peephole.
And now it's in there.
There's also this sequence, which I'm not happy about. It's the result of:
unsigned char a = (((unsigned char)val & (char)0x0F) + (char)'0');
mov r2, r5
swpb r5
srl r5, 8
andi r5, >F
ai r5, >30
swpb r5
I think this would be better:
mov r2, r5
andi r5, >F
ai r5, >30
swpb r5
So I've added an optimization for (int)X = (unsigned char)((int)X). This replaces:
mov r2, r5 # clocks:14
swpb r5 # clocks:10
sra r5, 8 # clocks:12+16
# total=52 clocks
with:
mov r2, r5 # clocks: 14
andi r5, >00FF # clocks: 14+4
# total=32 clocks
This is nearly twice as fast, and in the case where no MOV is needed, even faster. This makes me happy again.
clr r2 # clocks: 10 bytes: 2
movb r2, *r1 # clocks: 4+14+6 = 24 bytes: 2
total: 34 clocks, 4 bytes
For indexed memory locations:
clr r2 # clocks: 10 bytes: 2
movb r2, @1(r1) # clocks: 4+14+8= 26 Bytes:4
total: 36 clocks, 6 bytes
I can do better than that using subract instructions:
sb *r1, *r1 # clocks: 4+14+6+6 = 30 bytes: 2
total: 30 clocks, 2 bytes
sb @1(r1), @1(r1) # clocks: 4+14+8+8 = 34 bytes: 6
total: 34 clocks, 6 bytes
The first form is about 12% faster, the second is about 6% faster. This isn't a huge improvement, but it's still better. This is probably best added as a peephole.
And now it's in there.
There's also this sequence, which I'm not happy about. It's the result of:
unsigned char a = (((unsigned char)val & (char)0x0F) + (char)'0');
mov r2, r5
swpb r5
srl r5, 8
andi r5, >F
ai r5, >30
swpb r5
I think this would be better:
mov r2, r5
andi r5, >F
ai r5, >30
swpb r5
So I've added an optimization for (int)X = (unsigned char)((int)X). This replaces:
mov r2, r5 # clocks:14
swpb r5 # clocks:10
sra r5, 8 # clocks:12+16
# total=52 clocks
with:
mov r2, r5 # clocks: 14
andi r5, >00FF # clocks: 14+4
# total=32 clocks
This is nearly twice as fast, and in the case where no MOV is needed, even faster. This makes me happy again.
Subscribe to:
Comments (Atom)
