Wednesday, August 27, 2014

I tried to compare the function and inline methods to see which is the most efficient. Unfortunately, the variables which must be considered are not easy to predict (number of and ratio between division and modulus calls, number of registers to save around function calls).

With much hand waving and shady math, I determined that if there are five or less signed division calls, the inline method wins out.

Ultimately, it doesn't matter. I am going to implement this in an inline form, and if it turns out that calling out to a function to perform division is more efficient, the user can write their code to do that. The other reason is that there is no good way for the user to somehow inline division if the function method is worse.

Tuesday, August 26, 2014

Apparently, division has problems too. Here are the failed scenarios:

X/10    -> Results in -1 for all negative X
X/(-10) -> Results in -1 for all negative X, 0 for all positive X
N/X     -> Internal compiler error for all N

The compiler error looks like it will be a pain to track down, so I'm looking at the incorrect math first.

Here's a simple function I'm using for testing, assume a=-100, b=10:

# int divide(int a, int b) {return a/b;}
divide:               # R1=0xFF9B R2=0xA
        mov  r1, r4   # R4=0xFF9B
        seto r3       # R3=0xFFFF
        jlt  $+>4
        clr  r3      
        div  r2, r3   # [R3,R4]/R2 = 0xFFFFFF9B/0xA
        mov  r3, r1   # R3 holds quotient
        b    *r11

If DIV does signed division, we should see a result of 0xFFF5 (-10)
If DIV only does unsigned division, we should have a result of 0x998F. For some reason we see 0xFFFF, weird.

Nope, not weird, I just missed a really important note in the data sheet: "... if the divisor is less that the data contained in Rd ... Rd and Rd+1 are unchanged." I think this is the second time I've really messed up division. Anyway, DIV apparently only does unsigned division, so I need to think about this a bit.

div_unsigned:         # return A/B, R1=A, R2=B
        clr  r0       # Clear high word of [R0,R1]
        div  r2, r0   # R0 = [R0,R1]/R2
        mov  r0, r1   # Move result to return position
        b    *r11     # Return result

Probably the best solution here is to find the sign of the quotient, convert both operands to their absolute value, divide, then set the sign of the result. No other approach I've looked at would be smaller or faster than this.

div_signed:           # return A/B, R1=A, R2=B
        mov r1, r3
        xor r2, r3    # Calculate sign of result
        abs r1        # Make arguments positive
        abs r2
        clr r0        # Clear high word
        div r2, r0    # R0 = [R0,R1]/R2
        inv r3
        jlt posval
        neg r0        # Negate result if necessary
posval: mov r0, r1    # Move result to return position
        b   *r11      # Return result
  22 bytes
       
OK, what about modulo? As of the C 1999 and C++ 2011 standards, the sign of the modulo must match that of the dividend.

Examples:
 103% 10 =  3
 103%-10 =  3
-103% 10 = -3
-103%-10 = -3

For The unsigned case, we can easily modify the division function by returning the modulo stored in R1. For the signed case, it gets trickier. We need to save the sign of the dividend somewhere and use it after the calculation is complete.

mod_unsigned:         # return A%B, R1=A, R2=B
        clr  r0       # Clear high word of [R0,R1]
        div  r2, r0   # R1 = [R0,R1]%R2, return value already in position
        b    *r11     # Return result

mod_signed:           # return A/B, R1=A, R2=B
        mov r1, r3    # Save unmodified dividend
        abs r1        # Make arguments positive
        abs r2
        clr r0        # Clear high word
        div r2, r0    # R1 = [R0,R1]%R2
        inv r3
        jlt posval
        neg r1        # Negate result if needed, already in position
posval: b   *r11      # Return result
  18 bytes

Any way to do a signed divmod, combining these two? Without lots of extra work?

divmod_signed:        # return A/B and A%B, R1=A, R2=B
        mov r1, r3
        xor r2, r3    # Calculate sign of quotient
        mov r1, r4    # Save unmodified dividend
        abs r1        # Make arguments positive
        abs r2
        clr r0        # Clear high word
        div r2, r0    # R0 = [R0,R1]/R2
        inv r3
        jlt posdiv
        neg r0        # Negate quotient if needed
posdiv: inv r4
        jlt posmod
        neg r1        # Negate modulus if needed
posmod: mov r0, r1    # Move result to return position
        b   *r11      # Return result
  30 bytes, 5 registers

This is really bulky and ugly, but ultimately better than two separate calls. About 25% smaller and maybe 75% faster. It might make sense to move this into a library call because if there is a lot of math, the code could get really big. The overhead for the function call seems fairly small too. need to do the math and get real numbers though.

Monday, August 25, 2014

I think I found a problem in the shift instructions.

I made a simple function to emit a variable shift instruction (EX: sla r12, r0). I was looking at the edge cases, and realized that if the variable shift value is zero, the instruction will actually shift by 16 bits. The C standard requires that in this case, no shift is to be performed. This is a problem.

I took a look at the shift instructions for Arm and x86, and they shift from 0 to N-1 bits for a N bit value. The TI instructions instead shift from 1 to N bits. Hmm..

So what I need to do is insert a check every time that a shift of this type is called for. Something like this:

  # Left shift. Shift count in R0, value in R1
  andi r0, >000F  # Mask shift value, check for zero count
  jeq $+4         # If zero, jump over shift
  sla r1, r0      # Shift by non-zero bits

I'm not really happy about this, it adds 6 bytes for every use and lots of clocks. Unfortunately, I don't have a choice if I want to emit correct code.

After testing, this is now working properly.

Sunday, August 24, 2014

Hey, check it out. I got correct code finally:

Input C code:

unsigned char Round;
#define VDPWD *((volatile char*)0x8C00)
void modulo()
{
  VDPWD = '0' + Round % 10;
}

Output assembly:

modulo:
        movb @Round, r1
        srl  r1, 8
        mov  r1, r2
        clr  r1
        li   r3, >A        div  r3, r1
        mov  r2, r1
        swpb r1  # instruction is right where it's supposed to be
        ai   r1, >3000
        movb r1, @>8C00
        b    *r11

OK, I'll admit it. At first that doesn't seem too exciting. The thing is that I've been looking for a way to cleanly handle this case for a long, long time. I walked away in frustration because I got stuck on this problem. I just need to test this a bit more and we can call this done.

I was looking through older development notes to see if there was anything else left semi-implemented. I found a note to see if there can be any way to remove two jumps in sequence like jlt/jmp

Here are all the conditional jumps and the logic they use:

Conditional flags:
  L - Logical greater than
  A - Arithmetic greater than
  Q - equal

Arithmetic:
<   ~A & ~Q  jlt
>    A & ~Q  jgt
==   Q       jeq
>=   A | Q   jgt,jeq  \__ Ick, go away.
<=  ~A | Q   jlt,jeq  /
!=  ~Q       jne

Logical:
<   ~L & ~Q  jl
>    L & ~Q  jh
==   Q       jeq
>=   L | Q   jhe
<=  ~L | Q   jle
!=  ~Q       jne

Unfortunately, there is no way to completely implement >= and <= in one instruction for all valid imputs, but if the range of the test values were known, we could occasionally replace the jgt,jeq sequence with single jhe instruction.

EX: For byte compares, if A=[0..127] and B=[0..127], then these sequences are equivalent:

jlt ADDR  --> jle ADDR
jeq ADDR

jgt ADDR  --> jhe ADDR
jeq ADDR

Unfortunately, here is no way for the compiler to recognize this situation. If the user knows that their range matches this, they can take advantage of the slightly faster code by doing an unsigned compare. Something like this works:

int fast_lt_eq(int a, int b)
{
  return ((unsigned int)a <= (unsigned int)b);
}

What else...

Way back in June, TheMole on Atariage reported that they were unable to compile a function whose return value was assigned in an asm block and implicitly typecast as the retun value. No problems now.

So at this point, I think I'm up to date on error reports. These are the last items before I move on to new stuff:

1) Test 32-bit shift by variable, I'm not sure I checked that
2) Test assembler for Editor/Assembler adherance for caps and digits for registers, comments, register defines, etc.
3) Check signed vs. unsigned division and multiplication

Friday, August 22, 2014

No more procrastinating, this won't get any easier.

During the UNSPEC_SUBREG experiment I  found the places in reload where these problem expressions are identified. If a word-byte conversion is detected, a swpb instruction is performed on the offending register. The embedded subreg will be converted to a word-sized register by the existing code, and we should be OK.

Here is an example instruction before modification:

(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
        (plus:QI (subreg:QI (reg:HI 25) 1)
            (const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
        (nil)))

And after modification:

(insn 22 8 9 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
        (subreg:QI (reg:HI 25) 1)) {movqi3}
(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
        (plus:HI (reg:HI 25)
            (const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)
        (nil)))

Unfortunately, I need to tweak this a bit more:

        div  r3, r1
        mov  r2, r1
        swpb r1  # <-- br="" finally="" here="" instruction="" is="" this="">        swpb r1  # \_ These are extras added due to multiple passes, ick
        swpb r1  # /
        ai   r1, >3000
        movb r1, @>8C00

I also need to make sure that I add code that peoperly handles all cases with embedded subregs.

So we have a few conditions to handle:

  Subreg as operation destination:
  (operation) (subreg R1) (reg) ->
    (operation) (R1) (reg)
    swpb R1 # Move value into byte position

  Subreg as operation source:
  (operation) (reg) (subreg) ->
    swpb R1 # Prepare R1 for next operation
    (operation) (R1) (reg)
    swpb R1 # If needed, restore R1 for later use

The first one will probably never happen, due to how the instructions are processed, but the second one is definitely causing problems. It seems to only appear when a word-to-byte conversion is needed, and the value is stored in a register and the register will not be needed after that instruction. Even though that sounds like a fairly rare circumstance, it seems to happen more often than I would like.

Wednesday, August 20, 2014

There's a few problems here. The obvious one is that the subreg expression is being thrown away. The not-so obvious one is that once that is fixed, we end up with an invalid instruction:

insn does not satisfy its constraints:
(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])
        (plus:QI (subreg:QI (reg:HI 1 r1) 1)
            (const_int 48 [0x30]))) 59 {addqi3} (nil))

What we need to do is split this into two expressions:
(set (reg:QI 1 r1)
        (subreg:QI (reg:HI 1 r1) 1))
(set (reg:QI 1 r1)
        (plus:QI 1 r1)
            (const_int 48 [0x30]))) 59 {addqi3} (nil))

No problem, right? Well, maybe.

We also need to factor in the live-ness of the register. If in this case we had R2 as the output, we might have to find a temp register to not modify R1. Alternately, we might have to swap R1, do some work and swap it back. Terrible.

On top of that, we might have to implement instructions with every possible variation of arguments. This will quadruple the machine description and make a testing nightmare.

One way out of this particular problem is to make sure that there is a defined mode for every operation. Hopefully, that will minimize the number of these problems. It's just a hunch, but it's the best idea I've got.

Monday, August 18, 2014

Even though I've been slacking off on the TI front, I've still been following the goings on at AtariAge. A common issue people are dealing with is that there a lot of error-prone steps to build GCC, causing some people to give up in frustration. This makes me sad.

In order to make things easier, I've made a script which bundles the patch files and does the download, patch and build sequence. This should make it much easier for people to get involved and to use any additional patches that come out.

Sunday, August 10, 2014

Apparently it's been a really long time since I've done TI stuff, but I haven't forgotten about it.

I was looking into the internal function "subst_reloads". This code modifies instructions to do register replacement and expression simplifications determined elsewhere. I thinking that I could use it to figure out how to restructure expressions involving word-byte conversion.

No dice. That function only replaces individual operands, typically replacing memory locations with hard registers. To do this properly, I'll need to muck around with the rtx records directly and manually restructure instructions. Needless to say, I'm not really excited to do this. Unfortunately, I need to find a solution here and all other attempts have failed.