Friday, August 26, 2011

Here's an example of using the wrong byte:

movb r9, @>8C02 -> should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02

I'm not proud of the fix I made, but here it is:

There is a function named "reload_inner_reg_of_subreg" which allowed a subreg expression which assumes the byte is stored in the low byte of a register. I made a change to that function to force a reload in that case. So now we see code like this:

mov r9, r3
mov r3, r2 <-- Unnecessary MOV
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02

This will work, but there is an unnecessary "mov r3, r2" instruction. What's going on here is that we are reallocating the subreg subject. (In this example, allocate R3 instead of using R9). During the register reallocaction process, we find that we need to do extra work to get the byte value, which causes the "swpb" instruction to be emitted.

I think I can find a better way to fix this.

Wednesday, August 24, 2011

I was looking at equates in the assembler, and noticed that for symbols defined before use end up with swapped bytes.

Example:
inc @equ
addr equ >1234
inc @equ

Ends up being assembled to:
inc @>3412
inc @>1234

Since the equate value is not known at the time the first reference is encountered, an internal fixup record is created for later evaluation. Unfortunately, md_apply_fix() used the wrong endianness when resolving these internal fixups. This was most likely an oversight when adapting code written for other targets to the TMS9900.

I also ran across do_org, which can assign a current address to assembled code. This allows skipping over memory in the code section. This is tempting to use for AORG, but I don't think that will work by itself. I need to think about this some more.

I've also added a new constant type to allow SBO, SBZ and TB to work properly.

Tuesday, August 23, 2011

I've got my V9T9 disk tool almost done. I'm missing a few FIB fields when working with data files, but this should be easy to fix later.

I got a bug report that STST is causing problems in the assembler. The problem with STST is that I accidentally configured GAS to look for two arguments, when the instruction only takes one. If someone attempts to use the instruction properly, it will complain and emit an error. This is an easy one line fix.

It also looks like the CRU instructions need some attention. SBO SBZ and TB are set to use a constant in the same way as JMP, which doen't seem right.

It's late, so that will have to wait for tomorrow.

There was also a report that late EQU's were not being evaluated properly if referenced before the assignation.

Also, there's a typecast bug I have yet to look at.

Tuesday, August 16, 2011

I was looking a 32-bit division, and I was trying to find a way to take advantage of the 16-bit DIV instruction. Here's what I came up with:

X: numerator
Y: denominator
Q: ratio
N: 16-bit radix (2^16)

X/Y = Q
X = A*N+B
Y = C*N+D
N = 1<<16

Replace terms:
(A*N+B)/(C*N+D) = Q

B will be lost due to significant figures
Multiply denominator by ((1/C)/(1/C)), this eliminates 32-bit division:
(A*N)/((N+D/C)/C) = Q

Multiply numerator by ((1/(N+D/C))/(1/(N+D/C))):
((A*N)/((N+D/C))/C = Q

Multiply numerator by ((1/2)/(1/2)), this ensures all partial terms fit into a 16-bit quantity:
((A*N/2)/((N+D/C)/2))/C = Q

Decompose into partial terms:
V1 = D/C
V2 = N/2+V1/2
V3 = (A*N/2)/V2
P = V3/C

Due to the stackup of integer truncation, there will be rounding errors. Testing over the range of valid inputs shows that the result is accurate to +-1. Another step is required to fix this approximation.

Account for the approximation (Z is error due to truncation):
Q = P+Z

Replace Q with earlier equation
P+Z = A*N/(C*N+D)

Multiply both sides by (C*N+D):
A*N = (P*C*N+P*D)+(Z*C*N+Z*D)

Divide both sides by N, solve for Z.
Terms involving D are lost due to significant figures
A - P*C = Z*C

If Z<=0, any truncation error is covered by rounding error
If Z>0, the estimate "P" is one greater than the true result

So:
if(A > P*C), P := P-1

Untested assembly below. X Passed on [r1,r2], Y Passed on [r3,r4], result passed in [r1,r2], this assumes unsigned operands

# Cycles
mov r4, r5 # 14 : Copy D to temp register
clr r4 # 10 : Clear high word, prepare for division

div r3, r4 # 124 : R4 = C/D {V1}

srl r4, 1 # 14
ai r4, >1000 # 18 : R4 = N/2+V1/2 {V2}

mov r1, r5 # 14 : Save unmodified A for later

mov r1, r2 # 14
src r2, 1 # 14
ai r2, >1000 # 18
srl r1, 1 # 14 : [R1,R2] = A*N/2
div r1, r4 # 124 : R1 = (A*N/2)/V2 {V3}

mov r1, r2 # 14
clr r1 # 10
div r3, r1 # 124 : R1 = V3/C {P}

mpy r1, r3 # 52 : [R3,R4] = P*C
c r1, r5 # 14 : Compare A to P*C
jle +2 # 8 :
dec r1 # 10 :

mov r1, r2 # 14 : Move result into proper registers
clr r1 # 10 : [R1,R2] = P

total 634 clocks, 44 bytes (722 clocks including instruction loads)

This compares well to an earlier method using shifts and subtracts. That method uses a maximum of 7394 clocks and 50 bytes. (10082 clocks including instruction loads)

Assuming this all works out, this approach is fourteen times faster with a smaller footprint. I think I have a winner.

Monday, August 8, 2011

OK, it's patch time again.

Here's the changes in this release:

Fixed bug with byte initilizers, it was handling negative values wrongly
Fixed multiply bug, it was using the wrong registers
Changed grame pointer from R8 to R9. Frame was being lost
Byte reads from memory were assumed to be copied into register's LSB.
Fixed a problem with AND improperly modifying temp values.
Fixed a bug where R11 was not saved if used as a temp register.
Modified output to use hex values for all constants

I've also packaged up the ELF to EA5 converter and an example program made to run as an EA5 image.

The next thing on my list is to update all the documentation. Everything I've posted so far is still valid, but there are probably holes where some subjects need more information.

I also need to put together a library for the missing 32- but functions (multiply, divide, modulus, shift). These functions are alredy written and tested for the most psrt, so releasing them should be quick and easy.

Finally, I need to make my V9T9 disk tool ready for public consumption. It currently works, and the disk images it creates were used to test the EA5 converter, but it's super hacky at the moment. Once I spruce it up a bit and turn it into a useful tool, I can send it out the door.

Sunday, August 7, 2011

I was just about to devliver some patches, but remembered that I wanted to confirm that R11 and the fake PC register were working as expected. The PC works just fine, but R11 was not being saved properly if it was used as a temp register in a leaf function.

I found this out by making a function with 16 volatile ints stored on that stack, and returned the sum of all of them. That prevented the optimizer from removing one of these values, ensuring that all registers would be used. Well, all except the stack pointer, that's off-limits for obvious reasons. The C code and assembly output are shown below.

So, with that fixed and out of the way, I can finish the release.

int regtest(int a1, int a2, int a3, int a4, int a5, int a6)
{
volatile a7, a8, a9, a10, a11, a12, a13, a14, a15, a16;
return (a1+a2+a3+a4+a5+a6+a7+a8+a9+a10+a11+a12+a13+a14+a15+a16);
}


regtest
# R1 - R6 are used by the function arguments,
# making this test slightly smaller

# Allocate 30 bytes on the stack
ai r10, >FFE2

# Save non-volatile registers to the stack
mov r10, r0
mov r11, *r0+
mov r9, *r0+
mov r13, *r0+
mov r14, *r0+
mov r15, *r0

# Copy our junk data from the stack to registers
mov *r10, r15
mov @>2(r10), r6
mov @>4(r10), r14
mov @>6(r10), r13
mov @>8(r10), r9
mov @>A(r10), r11
mov @>C(r10), r0
mov @>E(r10), r12
mov @>10(r10), r8
mov @>12(r10), r7

# Add everything up
a r15, r6
a r14, r6
a r13, r6
a r9, r6
a r11, r6
a r0, r6
a r12, r6
a r8, r6
a r7, r6
a r1, r6
a r2, r6
a r3, r6
a r4, r6
a r5, r6
mov r6, r1

# As expected, the 16th value didn't make it into a register
a @>1C(r10), r1

# Restore non-volatile registers from the stack
# This also frees 8 bytes of stack space
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
mov *r10+, r14
mov *r10, r15

# Free the rest of the allocated stack space (22 + 8 = 30)
ai r10, >16
b *r11

Saturday, August 6, 2011

Well, it turns out that I can't use define_split for AND, since GCC requires exactly two RTL expressions after the split. Unfortunately, I need three split expressions for AND. What ends up happening in that case is that the entire split pattern is ignored, and errors are emitted during compilation.

What I ended up doing was to use the ANDHI3 pattern for these AND expressions, and use unnamed patterns for ANDI and AND with an inverted operand. As a named pattern, I can demand scratch registers. The unnamed patterns are only invoked if they match exactly, so they can't use scratches. This seems to work pretty well. Here's the output for "int and_mem_mem() {return(memb_a & memb_b);}":

and_mem_mem
movb @memb_b, r1
movb @memb_a, r2
inv r2
szcb r2, r1
sra r1, 8
b *r11

The instruction order is slightly different than I expected, but this is perfectly fine. The word AND forms have been modifified to match the changes I made to the byte forms.

With this out of the way, I can make a new set of patches and update the documentation. It looks like people are getting interested in using the compiler for their own projects, and I need to make sure all the information they might need is correct and available.

Thursday, August 4, 2011

I've found the cause of the omitted type casts. It turns out that I missed a branch in the instruction combination step which specifically checks for extension of values stored in memory. The exiting code did the usual job of assuming that typecasts from byte to word can be removed without consequence. Of course this is a tragic mistake for the TMS9900.

I need to find a way to allocate a scratch register during split to convert this:

and_mem_mem
movb @memb_a, r1
inv r1
movb @memb_b, r2
szcb r1, r2
movb r2, r1
sra r1, 8
b *r11

to this:
and_mem_mem
movb @memb_a, r2
inv r2
movb @memb_b, r1
szcb r2, r1 <--- eliminates MOV instruction
sra r1, 8
b *r11

Other stuff to do:
Check PC as fixed function again, might be different now.
Check R11 used as data register in non-leaf functions

Tuesday, August 2, 2011

Over on the AtariAge forums, Lucien2 is finding all sorts of neat stuff that's broken.

Here's the latest:
int test()
{
return( (*(char*)0x837C) & 31);
}

This gets converted to:
test
movb @>837C, r1
andi r1, >1F
b *r11

GCC is falsely assuming that the byte value is stored in the low-order byte of R1. It's optimizing out the typecast, resulting in bad code.

I've changed all the constant address code to use hex values instead of decimal. It looks much better now.

In order to unify output, I should convert all shift-by-eights to use hex shift counts.

Another odd thing:
int test()
{
return( (*(char*)0x837C) & (*(char*)0x820C));
}

Gets converted to this:
test
inv @>837C
movb @>820C, r1
szcb @>837C, r1

Technicallly, we do need to invert before using SZCB, but we shouldn't invert memory without a good reason. That invert should be done in a register.