I tried to compare the function and inline methods to see which is the most efficient. Unfortunately, the variables which must be considered are not easy to predict (number of and ratio between division and modulus calls, number of registers to save around function calls).

With much hand waving and shady math, I determined that if there are five or less signed division calls, the inline method wins out.

Ultimately, it doesn't matter. I am going to implement this in an inline form, and if it turns out that calling out to a function to perform division is more efficient, the user can write their code to do that. The other reason is that there is no good way for the user to somehow inline division if the function method is worse.

## Wednesday, August 27, 2014

## Tuesday, August 26, 2014

X/10 -> Results in -1 for all negative X

X/(-10) -> Results in -1 for all negative X, 0 for all positive X

N/X -> Internal compiler error for all N

The compiler error looks like it will be a pain to track down, so I'm looking at the incorrect math first.

Here's a simple function I'm using for testing, assume a=-100, b=10:

# int divide(int a, int b) {return a/b;}

divide: # R1=0xFF9B R2=0xA

mov r1, r4 # R4=0xFF9B

seto r3 # R3=0xFFFF

jlt $+>4

clr r3

div r2, r3 # [R3,R4]/R2 = 0xFFFFFF9B/0xA

mov r3, r1 # R3 holds quotient

b *r11

If DIV does signed division, we should see a result of 0xFFF5 (-10)

If DIV only does unsigned division, we should have a result of 0x998F. For some reason we see 0xFFFF, weird.

Nope, not weird, I just missed a really important note in the data sheet: "... if the divisor is less that the data contained in Rd ... Rd and Rd+1 are unchanged." I think this is the second time I've really messed up division. Anyway, DIV apparently only does unsigned division, so I need to think about this a bit.

div_unsigned: # return A/B, R1=A, R2=B

clr r0 # Clear high word of [R0,R1]

div r2, r0 # R0 = [R0,R1]/R2

mov r0, r1 # Move result to return position

b *r11 # Return result

Probably the best solution here is to find the sign of the quotient, convert both operands to their absolute value, divide, then set the sign of the result. No other approach I've looked at would be smaller or faster than this.

div_signed: # return A/B, R1=A, R2=B

mov r1, r3

xor r2, r3 # Calculate sign of result

abs r1 # Make arguments positive

abs r2

clr r0 # Clear high word

div r2, r0 # R0 = [R0,R1]/R2

inv r3

jlt posval

neg r0 # Negate result if necessary

posval: mov r0, r1 # Move result to return position

b *r11 # Return result

22 bytes

OK, what about modulo? As of the C 1999 and C++ 2011 standards, the sign of the modulo must match that of the dividend.

Examples:

103% 10 = 3

103%-10 = 3

-103% 10 = -3

-103%-10 = -3

For The unsigned case, we can easily modify the division function by returning the modulo stored in R1. For the signed case, it gets trickier. We need to save the sign of the dividend somewhere and use it after the calculation is complete.

mod_unsigned: # return A%B, R1=A, R2=B

clr r0 # Clear high word of [R0,R1]

div r2, r0 # R1 = [R0,R1]%R2, return value already in position

b *r11 # Return result

mod_signed: # return A/B, R1=A, R2=B

mov r1, r3 # Save unmodified dividend

abs r1 # Make arguments positive

abs r2

clr r0 # Clear high word

div r2, r0 # R1 = [R0,R1]%R2

inv r3

jlt posval

neg r1 # Negate result if needed, already in position

posval: b *r11 # Return result

18 bytes

Any way to do a signed divmod, combining these two? Without lots of extra work?

divmod_signed: # return A/B and A%B, R1=A, R2=B

mov r1, r3

xor r2, r3 # Calculate sign of quotient

mov r1, r4 # Save unmodified dividend

abs r1 # Make arguments positive

abs r2

clr r0 # Clear high word

div r2, r0 # R0 = [R0,R1]/R2

inv r3

jlt posdiv

neg r0 # Negate quotient if needed

posdiv: inv r4

jlt posmod

neg r1 # Negate modulus if needed

posmod: mov r0, r1 # Move result to return position

b *r11 # Return result

30 bytes, 5 registers

This is really bulky and ugly, but ultimately better than two separate calls. About 25% smaller and maybe 75% faster. It might make sense to move this into a library call because if there is a lot of math, the code could get really big. The overhead for the function call seems fairly small too. need to do the math and get real numbers though.

## Monday, August 25, 2014

I made a simple function to emit a variable shift instruction (EX: sla r12, r0). I was looking at the edge cases, and realized that if the variable shift value is zero, the instruction will actually shift by 16 bits. The C standard requires that in this case, no shift is to be performed. This is a problem.

I took a look at the shift instructions for Arm and x86, and they shift from 0 to N-1 bits for a N bit value. The TI instructions instead shift from 1 to N bits. Hmm..

So what I need to do is insert a check every time that a shift of this type is called for. Something like this:

# Left shift. Shift count in R0, value in R1

andi r0, >000F # Mask shift value, check for zero count

jeq $+4 # If zero, jump over shift

sla r1, r0 # Shift by non-zero bits

I'm not really happy about this, it adds 6 bytes for every use and lots of clocks. Unfortunately, I don't have a choice if I want to emit correct code.

After testing, this is now working properly.

## Sunday, August 24, 2014

Input C code:

unsigned char Round;

#define VDPWD *((volatile char*)0x8C00)

void modulo()

{

VDPWD = '0' + Round % 10;

}

Output assembly:

modulo:

movb @Round, r1

srl r1, 8

mov r1, r2

clr r1

li r3, >A div r3, r1

mov r2, r1

swpb r1 # instruction is right where it's supposed to be

ai r1, >3000

movb r1, @>8C00

b *r11

OK, I'll admit it. At first that doesn't seem too exciting. The thing is that I've been looking for a way to cleanly handle this case for a long, long time. I walked away in frustration because I got stuck on this problem. I just need to test this a bit more and we can call this done.

I was looking through older development notes to see if there was anything else left semi-implemented. I found a note to see if there can be any way to remove two jumps in sequence like jlt/jmp

Here are all the conditional jumps and the logic they use:

Conditional flags:

L - Logical greater than

A - Arithmetic greater than

Q - equal

Arithmetic:

< ~A & ~Q jlt

> A & ~Q jgt

== Q jeq

>= A | Q jgt,jeq \__ Ick, go away.

<= ~A | Q jlt,jeq /

!= ~Q jne

Logical:

< ~L & ~Q jl

> L & ~Q jh

== Q jeq

>= L | Q jhe

<= ~L | Q jle

!= ~Q jne

Unfortunately, there is no way to completely implement >= and <= in one instruction for all valid imputs, but if the range of the test values were known, we could occasionally replace the jgt,jeq sequence with single jhe instruction.

EX: For byte compares, if A=[0..127] and B=[0..127], then these sequences are equivalent:

jlt ADDR --> jle ADDR

jeq ADDR

jgt ADDR --> jhe ADDR

jeq ADDR

Unfortunately, here is no way for the compiler to recognize this situation. If the user knows that their range matches this, they can take advantage of the slightly faster code by doing an unsigned compare. Something like this works:

int fast_lt_eq(int a, int b)

{

return ((unsigned int)a <= (unsigned int)b);

}

What else...

Way back in June, TheMole on Atariage reported that they were unable to compile a function whose return value was assigned in an asm block and implicitly typecast as the retun value. No problems now.

So at this point, I think I'm up to date on error reports. These are the last items before I move on to new stuff:

1) Test 32-bit shift by variable, I'm not sure I checked that

2) Test assembler for Editor/Assembler adherance for caps and digits for registers, comments, register defines, etc.

3) Check signed vs. unsigned division and multiplication

## Friday, August 22, 2014

During the UNSPEC_SUBREG experiment I found the places in reload where these problem expressions are identified. If a word-byte conversion is detected, a swpb instruction is performed on the offending register. The embedded subreg will be converted to a word-sized register by the existing code, and we should be OK.

Here is an example instruction before modification:

(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])

(plus:QI (subreg:QI (reg:HI 25) 1)

(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)

(nil)))

And after modification:

(insn 22 8 9 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])

(subreg:QI (reg:HI 25) 1)) {movqi3}

(insn 9 8 10 2 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])

(plus:HI (reg:HI 25)

(const_int 48 [0x30]))) 59 {addqi3} (expr_list:REG_DEAD (reg:HI 25)

(nil)))

Unfortunately, I need to tweak this a bit more:

div r3, r1

mov r2, r1

swpb r1 # <-- br="" finally="" here="" instruction="" is="" this=""> swpb r1 # \_ These are extras added due to multiple passes, ick

swpb r1 # /

ai r1, >3000

movb r1, @>8C00

I also need to make sure that I add code that peoperly handles all cases with embedded subregs.

So we have a few conditions to handle:

Subreg as operation destination:

(operation) (subreg R1) (reg) ->

(operation) (R1) (reg)

swpb R1 # Move value into byte position

Subreg as operation source:

(operation) (reg) (subreg) ->

swpb R1 # Prepare R1 for next operation

(operation) (R1) (reg)

swpb R1 # If needed, restore R1 for later use

The first one will probably never happen, due to how the instructions are processed, but the second one is definitely causing problems. It seems to only appear when a word-to-byte conversion is needed, and the value is stored in a register and the register will not be needed after that instruction. Even though that sounds like a fairly rare circumstance, it seems to happen more often than I would like.

## Wednesday, August 20, 2014

insn does not satisfy its constraints:

(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])

(plus:QI (subreg:QI (reg:HI 1 r1) 1)

(const_int 48 [0x30]))) 59 {addqi3} (nil))

What we need to do is split this into two expressions:

(set (reg:QI 1 r1)

(subreg:QI (reg:HI 1 r1) 1))

(set (reg:QI 1 r1)

(plus:QI 1 r1)

(const_int 48 [0x30]))) 59 {addqi3} (nil))

No problem, right? Well, maybe.

We also need to factor in the live-ness of the register. If in this case we had R2 as the output, we might have to find a temp register to not modify R1. Alternately, we might have to swap R1, do some work and swap it back. Terrible.

On top of that, we might have to implement instructions with every possible variation of arguments. This will quadruple the machine description and make a testing nightmare.

One way out of this particular problem is to make sure that there is a defined mode for every operation. Hopefully, that will minimize the number of these problems. It's just a hunch, but it's the best idea I've got.

## Monday, August 18, 2014

In order to make things easier, I've made a script which bundles the patch files and does the download, patch and build sequence. This should make it much easier for people to get involved and to use any additional patches that come out.

## Sunday, August 10, 2014

I was looking into the internal function "subst_reloads". This code modifies instructions to do register replacement and expression simplifications determined elsewhere. I thinking that I could use it to figure out how to restructure expressions involving word-byte conversion.

No dice. That function only replaces individual operands, typically replacing memory locations with hard registers. To do this properly, I'll need to muck around with the rtx records directly and manually restructure instructions. Needless to say, I'm not really excited to do this. Unfortunately, I need to find a solution here and all other attempts have failed.

Subscribe to:
Posts (Atom)