I tried to compare the function and inline methods to see which is the most efficient. Unfortunately, the variables which must be considered are not easy to predict (number of and ratio between division and modulus calls, number of registers to save around function calls).
With much hand waving and shady math, I determined that if there are five or less signed division calls, the inline method wins out.
Ultimately, it doesn't matter. I am going to implement this in an inline form, and if it turns out that calling out to a function to perform division is more efficient, the user can write their code to do that. The other reason is that there is no good way for the user to somehow inline division if the function method is worse.
Wednesday, August 27, 2014
Tuesday, August 26, 2014
Apparently, division has problems too. Here are the failed scenarios:
X/10 -> Results in -1 for all negative X
X/(-10) -> Results in -1 for all negative X, 0 for all positive X
N/X -> Internal compiler error for all N
The compiler error looks like it will be a pain to track down, so I'm looking at the incorrect math first.
Here's a simple function I'm using for testing, assume a=-100, b=10:
# int divide(int a, int b) {return a/b;}
divide: # R1=0xFF9B R2=0xA
mov r1, r4 # R4=0xFF9B
seto r3 # R3=0xFFFF
jlt $+>4
clr r3
div r2, r3 # [R3,R4]/R2 = 0xFFFFFF9B/0xA
mov r3, r1 # R3 holds quotient
b *r11
If DIV does signed division, we should see a result of 0xFFF5 (-10)
If DIV only does unsigned division, we should have a result of 0x998F. For some reason we see 0xFFFF, weird.
Nope, not weird, I just missed a really important note in the data sheet: "... if the divisor is less that the data contained in Rd ... Rd and Rd+1 are unchanged." I think this is the second time I've really messed up division. Anyway, DIV apparently only does unsigned division, so I need to think about this a bit.
div_unsigned: # return A/B, R1=A, R2=B
clr r0 # Clear high word of [R0,R1]
div r2, r0 # R0 = [R0,R1]/R2
mov r0, r1 # Move result to return position
b *r11 # Return result
Probably the best solution here is to find the sign of the quotient, convert both operands to their absolute value, divide, then set the sign of the result. No other approach I've looked at would be smaller or faster than this.
div_signed: # return A/B, R1=A, R2=B
mov r1, r3
xor r2, r3 # Calculate sign of result
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R0 = [R0,R1]/R2
inv r3
jlt posval
neg r0 # Negate result if necessary
posval: mov r0, r1 # Move result to return position
b *r11 # Return result
22 bytes
OK, what about modulo? As of the C 1999 and C++ 2011 standards, the sign of the modulo must match that of the dividend.
Examples:
103% 10 = 3
103%-10 = 3
-103% 10 = -3
-103%-10 = -3
For The unsigned case, we can easily modify the division function by returning the modulo stored in R1. For the signed case, it gets trickier. We need to save the sign of the dividend somewhere and use it after the calculation is complete.
mod_unsigned: # return A%B, R1=A, R2=B
clr r0 # Clear high word of [R0,R1]
div r2, r0 # R1 = [R0,R1]%R2, return value already in position
b *r11 # Return result
mod_signed: # return A/B, R1=A, R2=B
mov r1, r3 # Save unmodified dividend
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R1 = [R0,R1]%R2
inv r3
jlt posval
neg r1 # Negate result if needed, already in position
posval: b *r11 # Return result
18 bytes
Any way to do a signed divmod, combining these two? Without lots of extra work?
divmod_signed: # return A/B and A%B, R1=A, R2=B
mov r1, r3
xor r2, r3 # Calculate sign of quotient
mov r1, r4 # Save unmodified dividend
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R0 = [R0,R1]/R2
inv r3
jlt posdiv
neg r0 # Negate quotient if needed
posdiv: inv r4
jlt posmod
neg r1 # Negate modulus if needed
posmod: mov r0, r1 # Move result to return position
b *r11 # Return result
30 bytes, 5 registers
This is really bulky and ugly, but ultimately better than two separate calls. About 25% smaller and maybe 75% faster. It might make sense to move this into a library call because if there is a lot of math, the code could get really big. The overhead for the function call seems fairly small too. need to do the math and get real numbers though.
X/10 -> Results in -1 for all negative X
X/(-10) -> Results in -1 for all negative X, 0 for all positive X
N/X -> Internal compiler error for all N
The compiler error looks like it will be a pain to track down, so I'm looking at the incorrect math first.
Here's a simple function I'm using for testing, assume a=-100, b=10:
# int divide(int a, int b) {return a/b;}
divide: # R1=0xFF9B R2=0xA
mov r1, r4 # R4=0xFF9B
seto r3 # R3=0xFFFF
jlt $+>4
clr r3
div r2, r3 # [R3,R4]/R2 = 0xFFFFFF9B/0xA
mov r3, r1 # R3 holds quotient
b *r11
If DIV does signed division, we should see a result of 0xFFF5 (-10)
If DIV only does unsigned division, we should have a result of 0x998F. For some reason we see 0xFFFF, weird.
Nope, not weird, I just missed a really important note in the data sheet: "... if the divisor is less that the data contained in Rd ... Rd and Rd+1 are unchanged." I think this is the second time I've really messed up division. Anyway, DIV apparently only does unsigned division, so I need to think about this a bit.
div_unsigned: # return A/B, R1=A, R2=B
clr r0 # Clear high word of [R0,R1]
div r2, r0 # R0 = [R0,R1]/R2
mov r0, r1 # Move result to return position
b *r11 # Return result
Probably the best solution here is to find the sign of the quotient, convert both operands to their absolute value, divide, then set the sign of the result. No other approach I've looked at would be smaller or faster than this.
div_signed: # return A/B, R1=A, R2=B
mov r1, r3
xor r2, r3 # Calculate sign of result
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R0 = [R0,R1]/R2
inv r3
jlt posval
neg r0 # Negate result if necessary
posval: mov r0, r1 # Move result to return position
b *r11 # Return result
22 bytes
OK, what about modulo? As of the C 1999 and C++ 2011 standards, the sign of the modulo must match that of the dividend.
Examples:
103% 10 = 3
103%-10 = 3
-103% 10 = -3
-103%-10 = -3
For The unsigned case, we can easily modify the division function by returning the modulo stored in R1. For the signed case, it gets trickier. We need to save the sign of the dividend somewhere and use it after the calculation is complete.
mod_unsigned: # return A%B, R1=A, R2=B
clr r0 # Clear high word of [R0,R1]
div r2, r0 # R1 = [R0,R1]%R2, return value already in position
b *r11 # Return result
mod_signed: # return A/B, R1=A, R2=B
mov r1, r3 # Save unmodified dividend
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R1 = [R0,R1]%R2
inv r3
jlt posval
neg r1 # Negate result if needed, already in position
posval: b *r11 # Return result
18 bytes
Any way to do a signed divmod, combining these two? Without lots of extra work?
divmod_signed: # return A/B and A%B, R1=A, R2=B
mov r1, r3
xor r2, r3 # Calculate sign of quotient
mov r1, r4 # Save unmodified dividend
abs r1 # Make arguments positive
abs r2
clr r0 # Clear high word
div r2, r0 # R0 = [R0,R1]/R2
inv r3
jlt posdiv
neg r0 # Negate quotient if needed
posdiv: inv r4
jlt posmod
neg r1 # Negate modulus if needed
posmod: mov r0, r1 # Move result to return position
b *r11 # Return result
30 bytes, 5 registers
This is really bulky and ugly, but ultimately better than two separate calls. About 25% smaller and maybe 75% faster. It might make sense to move this into a library call because if there is a lot of math, the code could get really big. The overhead for the function call seems fairly small too. need to do the math and get real numbers though.
Monday, August 25, 2014
I think I found a problem in the shift instructions.
I made a simple function to emit a variable shift instruction (EX: sla r12, r0). I was looking at the edge cases, and realized that if the variable shift value is zero, the instruction will actually shift by 16 bits. The C standard requires that in this case, no shift is to be performed. This is a problem.
I took a look at the shift instructions for Arm and x86, and they shift from 0 to N-1 bits for a N bit value. The TI instructions instead shift from 1 to N bits. Hmm..
So what I need to do is insert a check every time that a shift of this type is called for. Something like this:
# Left shift. Shift count in R0, value in R1
andi r0, >000F # Mask shift value, check for zero count
jeq $+4 # If zero, jump over shift
sla r1, r0 # Shift by non-zero bits
I'm not really happy about this, it adds 6 bytes for every use and lots of clocks. Unfortunately, I don't have a choice if I want to emit correct code.
After testing, this is now working properly.
I made a simple function to emit a variable shift instruction (EX: sla r12, r0). I was looking at the edge cases, and realized that if the variable shift value is zero, the instruction will actually shift by 16 bits. The C standard requires that in this case, no shift is to be performed. This is a problem.
I took a look at the shift instructions for Arm and x86, and they shift from 0 to N-1 bits for a N bit value. The TI instructions instead shift from 1 to N bits. Hmm..
So what I need to do is insert a check every time that a shift of this type is called for. Something like this:
# Left shift. Shift count in R0, value in R1
andi r0, >000F # Mask shift value, check for zero count
jeq $+4 # If zero, jump over shift
sla r1, r0 # Shift by non-zero bits
I'm not really happy about this, it adds 6 bytes for every use and lots of clocks. Unfortunately, I don't have a choice if I want to emit correct code.
After testing, this is now working properly.
Sunday, August 24, 2014
Hey, check it out. I got correct code finally:
Input C code:
unsigned char Round;
#define VDPWD *((volatile char*)0x8C00)
void modulo()
{
VDPWD = '0' + Round % 10;
}
Output assembly:
modulo:
movb @Round, r1
srl r1, 8
mov r1, r2
clr r1
li r3, >A div r3, r1
mov r2, r1
swpb r1 # instruction is right where it's supposed to be
ai r1, >3000
movb r1, @>8C00
b *r11
OK, I'll admit it. At first that doesn't seem too exciting. The thing is that I've been looking for a way to cleanly handle this case for a long, long time. I walked away in frustration because I got stuck on this problem. I just need to test this a bit more and we can call this done.
I was looking through older development notes to see if there was anything else left semi-implemented. I found a note to see if there can be any way to remove two jumps in sequence like jlt/jmp
Here are all the conditional jumps and the logic they use:
Conditional flags:
L - Logical greater than
A - Arithmetic greater than
Q - equal
Arithmetic:
< ~A & ~Q jlt
> A & ~Q jgt
== Q jeq
>= A | Q jgt,jeq \__ Ick, go away.
<= ~A | Q jlt,jeq /
!= ~Q jne
Logical:
< ~L & ~Q jl
> L & ~Q jh
== Q jeq
>= L | Q jhe
<= ~L | Q jle
!= ~Q jne
Unfortunately, there is no way to completely implement >= and <= in one instruction for all valid imputs, but if the range of the test values were known, we could occasionally replace the jgt,jeq sequence with single jhe instruction.
EX: For byte compares, if A=[0..127] and B=[0..127], then these sequences are equivalent:
jlt ADDR --> jle ADDR
jeq ADDR
jgt ADDR --> jhe ADDR
jeq ADDR-->
Unfortunately, here is no way for the compiler to recognize this situation. If the user knows that their range matches this, they can take advantage of the slightly faster code by doing an unsigned compare. Something like this works:
int fast_lt_eq(int a, int b)
{
return ((unsigned int)a <= (unsigned int)b);
}
What else...
Way back in June, TheMole on Atariage reported that they were unable to compile a function whose return value was assigned in an asm block and implicitly typecast as the retun value. No problems now.
So at this point, I think I'm up to date on error reports. These are the last items before I move on to new stuff:
1) Test 32-bit shift by variable, I'm not sure I checked that
2) Test assembler for Editor/Assembler adherance for caps and digits for registers, comments, register defines, etc.
3) Check signed vs. unsigned division and multiplication
Input C code:
unsigned char Round;
#define VDPWD *((volatile char*)0x8C00)
void modulo()
{
VDPWD = '0' + Round % 10;
}
Output assembly:
modulo:
movb @Round, r1
srl r1, 8
mov r1, r2
clr r1
li r3, >A div r3, r1
mov r2, r1
swpb r1 # instruction is right where it's supposed to be
ai r1, >3000
movb r1, @>8C00
b *r11
OK, I'll admit it. At first that doesn't seem too exciting. The thing is that I've been looking for a way to cleanly handle this case for a long, long time. I walked away in frustration because I got stuck on this problem. I just need to test this a bit more and we can call this done.
I was looking through older development notes to see if there was anything else left semi-implemented. I found a note to see if there can be any way to remove two jumps in sequence like jlt/jmp
Here are all the conditional jumps and the logic they use:
Conditional flags:
L - Logical greater than
A - Arithmetic greater than
Q - equal
Arithmetic:
< ~A & ~Q jlt
> A & ~Q jgt
== Q jeq
>= A | Q jgt,jeq \__ Ick, go away.
<= ~A | Q jlt,jeq /
!= ~Q jne
Logical:
< ~L & ~Q jl
> L & ~Q jh
== Q jeq
>= L | Q jhe
<= ~L | Q jle
!= ~Q jne
Unfortunately, there is no way to completely implement >= and <= in one instruction for all valid imputs, but if the range of the test values were known, we could occasionally replace the jgt,jeq sequence with single jhe instruction.
EX: For byte compares, if A=[0..127] and B=[0..127], then these sequences are equivalent:
jlt ADDR --> jle ADDR
jeq ADDR
jgt ADDR --> jhe ADDR
jeq ADDR-->
Unfortunately, here is no way for the compiler to recognize this situation. If the user knows that their range matches this, they can take advantage of the slightly faster code by doing an unsigned compare. Something like this works:
int fast_lt_eq(int a, int b)
{
return ((unsigned int)a <= (unsigned int)b);
}
What else...
Way back in June, TheMole on Atariage reported that they were unable to compile a function whose return value was assigned in an asm block and implicitly typecast as the retun value. No problems now.
So at this point, I think I'm up to date on error reports. These are the last items before I move on to new stuff:
1) Test 32-bit shift by variable, I'm not sure I checked that
2) Test assembler for Editor/Assembler adherance for caps and digits for registers, comments, register defines, etc.
3) Check signed vs. unsigned division and multiplication
Friday, August 22, 2014
No more procrastinating, this won't get any easier.
During the UNSPEC_SUBREG experiment I found the places in reload where these problem expressions are identified. If a word-byte conversion is detected, a swpb instruction is performed on the offending register. The embedded subreg will be converted to a word-sized register by the existing code, and we should be OK.
Here is an example instruction before modification:
(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(plus:QI (subreg:QI (reg:HI 25) 1)
(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
(nil)))
And after modification:
(insn 22 8 9 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(subreg:QI (reg:HI 25) 1)) {movqi3}
(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(plus:HI (reg:HI 25)
(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
(nil)))
Unfortunately, I need to tweak this a bit more:
div r3, r1
mov r2, r1
swpb r1 # <-- br="" finally="" here="" instruction="" is="" this=""> swpb r1 # \_ These are extras added due to multiple passes, ick
swpb r1 # /
ai r1, >3000
movb r1, @>8C00
I also need to make sure that I add code that peoperly handles all cases with embedded subregs.
So we have a few conditions to handle:
Subreg as operation destination:
(operation) (subreg R1) (reg) ->
(operation) (R1) (reg)
swpb R1 # Move value into byte position
Subreg as operation source:
(operation) (reg) (subreg) ->
swpb R1 # Prepare R1 for next operation
(operation) (R1) (reg)
swpb R1 # If needed, restore R1 for later use
The first one will probably never happen, due to how the instructions are processed, but the second one is definitely causing problems. It seems to only appear when a word-to-byte conversion is needed, and the value is stored in a register and the register will not be needed after that instruction. Even though that sounds like a fairly rare circumstance, it seems to happen more often than I would like.
-->
During the UNSPEC_SUBREG experiment I found the places in reload where these problem expressions are identified. If a word-byte conversion is detected, a swpb instruction is performed on the offending register. The embedded subreg will be converted to a word-sized register by the existing code, and we should be OK.
Here is an example instruction before modification:
(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(plus:QI (subreg:QI (reg:HI 25) 1)
(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
(nil)))
And after modification:
(insn 22 8 9 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(subreg:QI (reg:HI 25) 1)) {movqi3}
(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
(plus:HI (reg:HI 25)
(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
(nil)))
Unfortunately, I need to tweak this a bit more:
div r3, r1
mov r2, r1
swpb r1 # <-- br="" finally="" here="" instruction="" is="" this=""> swpb r1 # \_ These are extras added due to multiple passes, ick
swpb r1 # /
ai r1, >3000
movb r1, @>8C00
I also need to make sure that I add code that peoperly handles all cases with embedded subregs.
So we have a few conditions to handle:
Subreg as operation destination:
(operation) (subreg R1) (reg) ->
(operation) (R1) (reg)
swpb R1 # Move value into byte position
Subreg as operation source:
(operation) (reg) (subreg) ->
swpb R1 # Prepare R1 for next operation
(operation) (R1) (reg)
swpb R1 # If needed, restore R1 for later use
The first one will probably never happen, due to how the instructions are processed, but the second one is definitely causing problems. It seems to only appear when a word-to-byte conversion is needed, and the value is stored in a register and the register will not be needed after that instruction. Even though that sounds like a fairly rare circumstance, it seems to happen more often than I would like.
-->
Wednesday, August 20, 2014
There's a few problems here. The obvious one is that the subreg expression is being thrown away. The not-so obvious one is that once that is fixed, we end up with an invalid instruction:
insn does not satisfy its constraints:
(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])
(plus:QI (subreg:QI (reg:HI 1 r1) 1)
(const_int 48 [0x30]))) 59 {addqi3} (nil))
What we need to do is split this into two expressions:
(set (reg:QI 1 r1)
(subreg:QI (reg:HI 1 r1) 1))
(set (reg:QI 1 r1)
(plus:QI 1 r1)
(const_int 48 [0x30]))) 59 {addqi3} (nil))
No problem, right? Well, maybe.
We also need to factor in the live-ness of the register. If in this case we had R2 as the output, we might have to find a temp register to not modify R1. Alternately, we might have to swap R1, do some work and swap it back. Terrible.
On top of that, we might have to implement instructions with every possible variation of arguments. This will quadruple the machine description and make a testing nightmare.
One way out of this particular problem is to make sure that there is a defined mode for every operation. Hopefully, that will minimize the number of these problems. It's just a hunch, but it's the best idea I've got.
insn does not satisfy its constraints:
(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])
(plus:QI (subreg:QI (reg:HI 1 r1) 1)
(const_int 48 [0x30]))) 59 {addqi3} (nil))
What we need to do is split this into two expressions:
(set (reg:QI 1 r1)
(subreg:QI (reg:HI 1 r1) 1))
(set (reg:QI 1 r1)
(plus:QI 1 r1)
(const_int 48 [0x30]))) 59 {addqi3} (nil))
No problem, right? Well, maybe.
We also need to factor in the live-ness of the register. If in this case we had R2 as the output, we might have to find a temp register to not modify R1. Alternately, we might have to swap R1, do some work and swap it back. Terrible.
On top of that, we might have to implement instructions with every possible variation of arguments. This will quadruple the machine description and make a testing nightmare.
One way out of this particular problem is to make sure that there is a defined mode for every operation. Hopefully, that will minimize the number of these problems. It's just a hunch, but it's the best idea I've got.
Monday, August 18, 2014
Even though I've been slacking off on the TI front, I've still been following the goings on at AtariAge. A common issue people are dealing with is that there a lot of error-prone steps to build GCC, causing some people to give up in frustration. This makes me sad.
In order to make things easier, I've made a script which bundles the patch files and does the download, patch and build sequence. This should make it much easier for people to get involved and to use any additional patches that come out.
In order to make things easier, I've made a script which bundles the patch files and does the download, patch and build sequence. This should make it much easier for people to get involved and to use any additional patches that come out.
Sunday, August 10, 2014
Apparently it's been a really long time since I've done TI stuff, but I haven't forgotten about it.
I was looking into the internal function "subst_reloads". This code modifies instructions to do register replacement and expression simplifications determined elsewhere. I thinking that I could use it to figure out how to restructure expressions involving word-byte conversion.
No dice. That function only replaces individual operands, typically replacing memory locations with hard registers. To do this properly, I'll need to muck around with the rtx records directly and manually restructure instructions. Needless to say, I'm not really excited to do this. Unfortunately, I need to find a solution here and all other attempts have failed.
I was looking into the internal function "subst_reloads". This code modifies instructions to do register replacement and expression simplifications determined elsewhere. I thinking that I could use it to figure out how to restructure expressions involving word-byte conversion.
No dice. That function only replaces individual operands, typically replacing memory locations with hard registers. To do this properly, I'll need to muck around with the rtx records directly and manually restructure instructions. Needless to say, I'm not really excited to do this. Unfortunately, I need to find a solution here and all other attempts have failed.
Subscribe to:
Posts (Atom)