Monday, December 31, 2012

I've been working for an hour or less most days since the last update, and it didn't make sense to write an update for each day. So here's a summary of what's been happening.

I had originally intended 32-bit multiplies to be implemented in libgcc, but instead GCC is trying to cobble something together using 16-bit multiplies. That results in unexpectedly wrong results when signed multiplication is used. The problem here is that MPY does only unsigned multiplication, but the way the MULHI3 pattern is written, GCC assumes that signed multiplication is used. Unfortunately, I could find no way to rewrite this instruction to result in nice-looking code.

The only way I could resolve the situation was to implement a MULSI3 instruction for inline 32-bit multiplies. There was an implementation commented out in the machine description file, but I was able to reduce it to nine instructions and two temporary registers. I'm not happy about this, but I can't think of a better solution.

The implementation of the TMS9900 libgcc source code has been totally refactored. Initially, I had an individual file for each libgcc function. There were quite a lot of them, and I would see default code unexpectedly linked into the test code. The problem there was that libgcc expects all assembly optimizations to be stored in a single file, pointed to the LIB1ASMSRC symbol in the makefile. Additionally, the LIB1ASMFUNCS symbol needs to contain a list of the implemented functions so the default ones will not be linked into the library. This was annoying to track down, since I couldn't find documentation for any of this, and had to rely on code used by other processors. All the libgcc functions have been tested, and work as expected.

I looked at the floating point emulation code, but it is mostly a mess of precompiler macros, and fails to build for 16-bit architectures. I think I got an error like "Here's a nickel kid, buy yourself a real processor..." This did not make me happy. My options now are to either refactor the existing code to support 16-bit machines, or write a new library from scratch. If I'm honest the second option seems more interesting.

After looking though the GCC documentation, I may have found a way to get rid of the fake PC register using UNSPEC instructions. These are unspecified machine-specific operations that might be used to handle returns or maybe even byte-to-word conversions. I would need to do some major testing for that one though.

Thursday, December 13, 2012

All functions are now implemented, now it's time for validation testing. I've already checked out the divide and modulus functions, since I knew that they would tricky.

I've started testing 32-bit multiplies, and I keep seeing results consistent with unsigned multiply, regardless of the type of operation called for. I'm also seeing that the __mulsi3 function in libgcc is not being used. It's been long enough that I've forgotten if I've implemented a 32-bit multiply recipe in the machine description file. I better check that out.

Sunday, November 25, 2012

I spent the last couple of days working on libgcc, and got all missing functions done except the division and modulus stuff.

Once this is done, I need to review everything and look for some bugs to fix. I know there's a problem with register counts for the function prologue and epilogoue, so that should be interesting.

Wednesday, November 21, 2012

So for the past few days I've been working on libgcc, making sure the compiler covers all instructions up to 32-bit operations.

Missing operations:

Count leading zero bits
__ctzsi2,__ctzhi2, __ctzqi2

Count trailing zero bits
__clzsi2,__clzhi2, __clzqi2

Find index of least significant bit
__ffssi2,__ffshi2, __ffsqi2

Return one if an even number of bits set
__paritysi2, __parityhi2, __parityqi2

Return number of set bits
__popcountsi2, __popcounthi2, __popcountqi

Signed division of 32-bit values

Unsigned division of 32-bit values

Calculate modulus of 32-bit values

Calculate unsigned modulus of 32-bit values

Do both division and modulus calculations

Do both unsigned division and modulus calculations

Multiply 32-bit values

For now, the trapped arithmetic instructions will be implemented using the default code. These functions call "abort" when there is an overflow condition, and are only needed in rare cases. The TMS9900 updates an overflow flag which we can use for this, but we can do that work later.

The other routines in libgcc are for floating-point and fixed-point math, odds are they are too big to really use. But again, I can fix that later.

Wednesday, November 7, 2012

I was trying to build libgcc to make the missing functions like __udivsi3 mentioned earlier. Unfortunately, there's a bug in GCC which causes the libgcc build to fail:

eric@compaq:~/dev/tios/toolchain/gcc-4.4.0/libgcc$ make
Makefile:143: ../.././gcc/libgcc.mvars: No such file or directory
make: *** No rule to make target `../.././gcc/libgcc.mvars'.  Stop.

After looking into this a bit, it tuns out my build directions were lacking. The GNU people always do their builds from a seperate directory, and say that any problems arising from building in the source directory will not be fixed.

So, rather than fight the world on this, I'm changing the build instructions. The following are to be done from the top level of the GCC source directory.

  $ mkdir build
  $ cd build
  $ ../configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
  $ make all-gcc
  $ make install
  $ mkdir libgcc/build
  $ cd libgcc/build
  $ ../../../libgcc/configure --prefix /home/eric/dev/tios/toolchain/ --host=tms9900
  $ make
  $ make install

Sunday, November 4, 2012

After being super busy at work for the last few months, I finally got some time to work on TI stuff again. I was in the middle of something earlier, but heck if I can remember what that was.

One of the people on AtariAge wanted to know if a FAT library would compile OK for the TI. Sounds like a good question, let's find out.

The first pass failed due to missing the libc headers for TI. For now, I'm using the i386 headers. As long as I don't try to link anything, it should be fine.

The second pass failed too, there is an invalid instruction being generated. This was super tedious to track down.

    inv  r2
    neg  >4    * invalid, only registers my be used
    inc  r2

(insn 418 417 784 fat_access.c:167 (set (reg:SI 2 r2 [230])
        (neg:SI (reg:SI 2 r2 [230]))) 68 {negsi2} (nil))

Once the instruction was isolated, the problem was obvious. I was overrunning an RTX array in the NEGSI2 instruction and the junk found there showed up in the final output. By using a properly-sized array for this instruction, all is well.

  [Nr] Name              Type            Size(Hex)  Size(Dec)
  [ 1] .text             PROGBITS        00005462 = 21602
  [ 3] .data             PROGBITS        00000030 = 48
  [ 4] .bss              NOBITS          00000CCA = 3274

Of course, this is missing a lot of other stuff:

So, yes, this library is compilable for the TI, but needs about 24KB for everything. Probably impractical.

I've seen this a few places in the output. It would be easy to optimize this, but this may be a rarely-used pattern, and not worth special attention.
        mov  r2, *r1
        mov  r2, @>2(r1)

This could better be written as:
    mov  r2, *r1+
    mov  r2, *r1

As an unrelated note, I thought for a minute that using the X instruction might be useful for the function prologue and epilogue, since we're iterating though registers. Unfortunately, it's not. At most, there are four instructions for either 'logue, and a lot more would be required if X were used. Never mind.

Sunday, July 15, 2012

Binutils 2.19.1 patch 1.4

Changes this version:

Fixed bug prohibiting the use of single quotes in a string Strings my be in either TI-style 'stuff' or C-style "stuff" TI-style strings follow E/A text rules C-style strings may include standard escape codes "example\n"


GCC 4.4.0 patch 1.6

Changes this version:

Fixed comparison against +-1 and +-2, were broken in 1.5
Prevented incorrect use of fake PC register for real work
Improved AND operations to reduce setup instructions
Fixed long-to-char conversions
Fixed post-increment pointers which live on the stack
Added optimization for setting byte quantities to zero
Added optimization for (int)X = (unsigned char)((int)X)
Removed double-counting saved registers on the stack
Reduced overhead needed for multiply instructions
Fixed bug causing structres to be loaded into registers
Structures used as function arguments now passed by reference
Fixed more bugs causing bad int-to-char conversions


Saturday, July 14, 2012

I was looking for another instance where int-to-char conversion was done incorrectly. Once again, GCC was assuming that byte quantities stored in registers are stored in the least significant byte. After much poking around, I found the problem in find_reloads. It was in a portion of code which was a default handler of reload conditions.

I had earlier fixed code in this function which handled most cases of subreg substitution, but I believe since the source and destination registers were the same in this case, execution skipped down to the default handler.

In any case, I added a check on that path to prevent modification of the subreg expression when int-to-char and char-to-int converstons are called for. These cases are properly handled in the machine description, and the correct instructions are issued as expected.

Unfortunately, once find_reloads was fixed, there was still a problem. The instruction was being broken in a different way, and looked something like this:

(insn 72 71 74 3 lucien2_8.c:39 (set (reg:QI 7 r7 [orig:73 D.1232 ] [73])
        (reg:QI 7 r7 [+1 ])) 71 {movqi} (nil))

This effectively returns us to the situation we just fixed, that the byte quantity was assumed to be in the least significant byte of a register.

After tracing execution and checking the evolution of the instruction, I found where the subreg expression was converted to the "+1" form above. This was done in alter_subreg.

The code there was intended to handle special cases which escaped processing by simplify_subreg. Since I've disabled a lot of sode which would remove subreg expressions, these default handlers are getting more use.

At this point, I need to go through my earlier notes, but I think it's time for another release.

Monday, May 21, 2012

I've noticed a few places where we have a sequence where a register is loaded with a constant, used for some operation, then discarded. Like this:
  li r1, 27
  mpy r1, r4

I wonder if it would be better to put these constants in the data section, and do something like this:
  mpy @const_27, r4

Lets check it out!

li r1, 27  - 4 bytes, 12+4+4=20 clocks
mpy r1, r4 - 2 byte, 52+4=56 clocks
total: 6 bytes, 76 clocks

.data const_27: data 27 - 2 bytes, 0 clocks
mpy @const_27, r4 - 2 bytes, 52+4+8=64 clocks
total: 4 bytes, 64 clocks

Hmm, pretty good... 33% smaller, 15% faster. I wasn't expecting that. This could be better if the constant is used in other places too. That would further reduce the effective bytes used for each instruction. (Average size is between 4 and 2 bytes per operation. Asymptotic to 2.) This is also more drastic for quicker instructions (like movb: 20+14+4+8=46 clocks vs. 26 clocks, 43% faster).

I should look at ways to optimize memset and memcpy sequences, there's got to be a way to take advantage of that.

Saturday, May 19, 2012

I fixed the problem with the use_regs error. The TARGET_PASS_BY_REFERENCE macro was not changed from the default value. This macro is evaluated to determine if function parameters which are to be passed by value should instead be silently copied to the stack and then that copy be passed by reference. This is the problem I saw earlier, the structure was being passed by value, but there were not enough registers to hold the bytes of that value.

The default action of TARGET_PASS_BY_REFERENCE is to never silently pass by reference (which causes the problem seen here). I've overridden this to instead use pass-by-reference for data types larger than four bytes or for aggregate data types (structures).

Saturday, May 12, 2012

I got some free time again to look at GCC stuff, so I'm going through the source code for Nyog Sothep looking at the problems there.

Right now I'm looking at a problem resulting in "internal compiler error: in use_regs, at expr.c:2245"

This is the result of an assert requiring that all registers be valid. This assert is within expr.c::use_regs.

The call_regs function is being called to validate the usage of a record requiring 19 registers, starting at R1. This sounds like an entire structure is being loaded into registers, and not just the element of that structure we are interested in.

This is the operation which is causing problems: has_item(investigators[i],RETURN_SPELL)

All of "investigators[i]" is being passed by value in registers instead of passed by reference using one register. Since we don't have enough registers to hold the entire structure, we get a fatal error.

Thursday, May 3, 2012

Well, I finally got multiply working using define_expand, but it's worse than what I had before.

This approach allocates two registers for the multiply result, and doesn't seem to allow for reuse of one of the input registers. The registers seem to be allocated at the beginning of the instruction, and any dead registers are marked as available for reuse after the instruction. This results in more used registers and an extra MOV instruction in most cases. The MOV is sued to copy one operand into the newly allocated 32-bit registers. This is no good, so I'll just back this out and pretend it never happened.

The good news is that this experience has given me more tools to use for the compiler. That's been handy for MULHI3, which is defined using define_expand, and results in nice clean code.

Monday, April 30, 2012

For the past couple of days I've been fighting with the compiler to try to get a working MULHI3 pattern. This is a 16-bit multiply with a 16-bit result. I already have a working multiply pattern, but it's awfully fiddly, and it's more complicated than I think it should be. In addition to all that, Lucien has a bug which shows up because of the missing MULHI3 pattern.

To make this work, I've been playing with using "define_expand" instead of "define_insn" in the machine description file. These are pretty similar, but I can have more control over temporary variables by using "define_expand". This allows more of the work I was doing in the existing multiply pattern to be pushed out into RTL expressions. At that point, GCC can automatically handle the special cases I was previously using brute force to handle. Most importantly, I won't have to be concernend about adding a bunch of superfluous MOVs or XORs to make everything work. GCC will just conveniently pick registers to make everything work without all that junk.

Saturday, April 21, 2012

Lucien on AtariAge has found a lot of problems in the compiler. Time to get back into bugfix mode.

First off there's a problem with stack construction. Apparently I was double counting the saved register space in the stack frame. The local variables are allocated from the start of the frame. I was confused and thought that the frame was required to be at the bottom of the stack, and so I made space for the saved registers, which are stored below the local variables.

This caused all frame local variables to be misaddressed, corrupting some saved registers. Lucien provided a fix, which was mostly right, but miscalculated the number of saved registers in some optimization modes. I think that's the last time the stack code needs to be changed.

There was another problem in the assembler, I thought i was being clever in allowing either TI-style single quotes or C-style double quotes for string constants. That code worked fine for C-style strings, but always closed the string when TI-style strings were used. This is a problem since GCC was configured to always use TI-style strings for the emitted assembly. This means there was no way to use single quotes in compiled code.

This was fixed by rewriting a good chunk of the string handling code in GAS. Now if a TI-style quote is used ('like this'), no escape sequences will be recognized except for the sequence '' which is converted to one single quote character. If C-style quotes are used ("like this"), all standard escape sequences are recognized. This allows assembly authors to chose the style they prefer while remaining compatible with the TI conventions. GCC always uses TI-style strings, so users will see their expected output.

I think there's another five or so GCC bugs in Lucien's latest game. I'll start going through it and see what I can do.

Monday, April 9, 2012

Since there were no more problems, I've gone back to printf. To make any more progress, I need to get 32-bit division working. I mentioned this approach earlier, but after I went over those notes, I found some problems.

So just to be sure, I derived everything from scratch. The notes here are the highlights, since I don't want to reproduce the three pages I needed for the proof here.

The idea is to transform the fraction to ensure a 16-bit denominator. This allows the use of the DIV instruction, and skips the more commonly-seen but much slower shift-and-subtract method usually used for this kind of math.

Let's start at the beginning. Assume positive numerator and denominator.

----- = Q

The denominator is 32-bit, so for non-zero results, the numerator must be 32-bits as well. The "n" constant is 0x10000.

------- = Q

Multiply by (1/c)/(1/c), rename to P to acknowledge the truncation error

----------- = P

Multiply by (1/2)/(1/2) to keep 16-bit denominator.

----------------- = P

Find the remainder to correct for truncation error. Given the requirements on the inputs, P will be no more than 16 bits wide.

(a*n+b) - P(c*n+d) = R

The bulk of the proof was in determining that this was the only correction needed over the entire domain of valid inputs:

If R<0 then decrement P by one

Here's the resulting C code. It's been verified on a PC using Monte Carlo methods over the valid domain. I just need to convert to TMS9900 assembly and count some clocks.

long div(unsigned long num, unsigned long den)
unsigned short c=den>>16;
unsigned short d=den&0xFFFF;
unsigned short v=0x8000+(d/2)/c;
unsigned long u=(num/2)/c;
unsigned long p = u/v;
long r = num-p*den;
{r+=den; p--;}

This requires one 32x16 division, one 32x32 multiply, a 32-bit add and a 32-bit subtract, All other operations can be made using 16-bit values.

Not too shabby.

Sunday, April 1, 2012

I found this sequence which is being used to clear a byte of memory:

clr r2 # clocks: 10 bytes: 2
movb r2, *r1 # clocks: 4+14+6 = 24 bytes: 2
total: 34 clocks, 4 bytes

For indexed memory locations:

clr r2 # clocks: 10 bytes: 2
movb r2, @1(r1) # clocks: 4+14+8= 26 Bytes:4
total: 36 clocks, 6 bytes

I can do better than that using subract instructions:

sb *r1, *r1 # clocks: 4+14+6+6 = 30 bytes: 2
total: 30 clocks, 2 bytes

sb @1(r1), @1(r1) # clocks: 4+14+8+8 = 34 bytes: 6
total: 34 clocks, 6 bytes

The first form is about 12% faster, the second is about 6% faster. This isn't a huge improvement, but it's still better. This is probably best added as a peephole.

And now it's in there.

There's also this sequence, which I'm not happy about. It's the result of:

unsigned char a = (((unsigned char)val & (char)0x0F) + (char)'0');

mov r2, r5
swpb r5
srl r5, 8
andi r5, >F
ai r5, >30
swpb r5

I think this would be better:

mov r2, r5
andi r5, >F
ai r5, >30
swpb r5

So I've added an optimization for (int)X = (unsigned char)((int)X). This replaces:
mov r2, r5 # clocks:14
swpb r5 # clocks:10
sra r5, 8 # clocks:12+16
# total=52 clocks

mov r2, r5 # clocks: 14
andi r5, >00FF # clocks: 14+4
# total=32 clocks

This is nearly twice as fast, and in the case where no MOV is needed, even faster. This makes me happy again.

Friday, March 30, 2012

I've been looking through the tms9900.c file, here are some random notes for later:

The SP currently points at the bottommost used register, but every other machine I know points at the topmost free address, I should probably chnage that. Shouldn't be a big deal, but may break backwards-compatability.

I fixed the df_ref_record crash. The problem was not respecting the difference between strict and non-strict validation in GO_IF_LEGITIMATE_ADDRESS. When in non-strict mode, all pseudoregisters are treated as if they were real registers. In strict mode, they are treated as memory locations.

Since this distinction was not made, the value (r10+18) was treated as a register, accepted by GO_IF_LEGITIMATE_ADDRESS, and things blew up later. Now, that construct is seen as invalid, and GCC refactors the expression to find an equivalent valid form.

Back to printf...

Wednesday, March 28, 2012

I recently upgraded my laptop to Ubuntu 11.10, and this has made a mess of the porting effort. Since I needed to relink against new libraries, Binutils and GCC need to be rebuilt. This shouldn't be a problem, but for some reason, the -Werror flag (treat all warnings as errors) is now respected, where it seemed to be ignored before. None of the warnings are in my code, so I don't feel bad disabling warnings.

So I've got to update the build directions for Binutils.

$ cd {path_to_original_files}
$ patch -p1 < patchfile

$ ./configure --target tms9900 --prefix /home/eric/dev/tios/toolchain --disable-build-warnings
$ make all
$ make install

$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install

Sunday, March 25, 2012

I Found this during testing:

printf.c: In function ‘printf’:
printf.c:239: internal compiler error: in df_ref_record, at df-scan.c:2803

This was the faulting expression:
(mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])

Since this expression doesn't show up in the debug output, I need to backtrace to find out what happened here

df_ref_record (cl=DF_REF_REGULAR)
df_uses_record (in the POST_INC case)

RTL of offending instruction:
(set (reg/v:QI 2 r2 [orig:49 c ] [49])
(mem:QI (post_inc:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])) [0 S1 A8]))

It looks like this came from this instruction, seen in 168r.asmcons

(insn 39 168 41 5 printf.c:152 (set (reg/v:QI 49 [ c ])
(mem:QI (post_inc:HI (reg/v/f:HI 46 [ p.44 ])) [0 S1 A8])) 71 {movqi} (expr_list:REG_INC (reg/v/f:HI 46 [ p.44 ])

In a more readable form:
movb 18(r10+), r2

It took me a while to figure this out, but GCC is not trying to increment R10, but the expression (R10+18), which rightly fails.

The pseudoregisgter R46 is "p" (a pointer into the current string) which is used later, as is R10, which is the stack pointer. This means that the post-increment expression needs to be removed somehow.

I imagine if I used the "register" keyword, that would probably fix this problem, but this would just show up again sometime later.

I need to take a closer look at the PDP11 config to see how they dealt with the problem.

Saturday, March 24, 2012

I played around with the compiler trying to reduce the test program output and found a few things.

There is a note attached to the add instruction that the R3 regisgter in HI mode is unused after that instruction, so we could skip the "jnc, inc" sequence.

The check commonly used for the other machines, "find_reg_note(insn, REG_UNUSED, operands[0])", will not work here. There are two reasons for this. The first is that the operand is in SI mode, but the note is for HI mode. The second is that the function only checks for pointer equality, and does not compare the RTX expressions to which the pointers refer. I'd have to make a new find_reg_note function to do note checking.

If I just change the order of typecasting, all this goes away:

char test(long a) {return((char)a+'0');}


mov r2, r1
swpb r1
ai r1, >3000
b *r11

This is the ideal sequence I was looking for originally, so another best practice seems to be to to late promotion and early demotion of datatypes. That seems to result in better code.

So for now, I'll put compiler changes on hold, and just go back to printf.

Friday, March 23, 2012

I'm tracing backwards through the call tree to find where this problem shows up. Call tree is shown below:


Found it. The problem is in the change I made earlier in simplify-rtx. That change was to handle int-to-char conversions, but it doesn't handle long-to-char properly. It was finding the correct register to use after conversion, but the wrong mode, this was described earlier during initial investigation.

The fix was to first convert long-to-int to find the correct register, then int-to-char to complete the conversion.

Finally, I get this code, which is longer than strictly necessary

clr r3 # Clear upper word of temp long
mov r2, r4 # Copy lower word of input to temp
ai r4, >30 # Add '0' to temp
jnc $+>4 # Add carry bit to upper word of temp
inc r3
mov r4, r1 # Copy low word of temp into output register
swpb r1 # Move result into position for return
b *r11

A better result would be more this like:

ai r2, >30 # Add '0' to low word of input
mov r2, r1 # Copy result into output register
swpb r1 # Move result into position for return
b *r11

Thursday, March 22, 2012

Here's a simplified version of the instructions up to step 182r.peephole2 for the "long" type.

inputs: r1,r2=val, r3=a_base (unused)
insn 24: r3=0 \_ Set temp 32-bit value to zero
insn 25: r4=r3 /
insn 28: r4=r2 <- Copy low word into temp value
insn 12: r4&=0x0F <- Apply mask
insn 13: r3(32-bit)+=48 <- '0'=48 <- Adding constant to low word
insn 14: r1=(char)r4
insn 15: number_buf[0]=r1
insn 33: return

GCC is claiming that R3 is unused (this is true), and so any manipulation of R3 can be safely removed. GCC also seems to be expanding this to R4, as it's part of the temporary 32-bit value defined in insn 24 and 25.

That results in these instructions being deleted in step 182:

insn 13
insn 12
insn 28
insn 25
insn 24

We are left with R4, with the note that it's in SImode (32-bit value), somehow the notion that R4 is the low word is lost, and R5 is selected for the low word of the value. This results in the final code:

mov r5, r1
swpb r1
movb r1, @number_buf
b *r11

Similar behavior was seen with this code as well:
char test(long a) {return(a+'0');}

mov r5, r1
swpb r1
b *r11

Testing will continue with this example, since it's much shorter.

I found that the problem happens during register allocation. In all prior steps, the RTX looks good, but at 172r.ira, the wrong register is chosen, which makes a mess of everything from that point on.

insn 3: (HI)((SI)r24)[0] = r1
insn 4: (HI)((SI)r24)[1] = r2
insn 9: (SI)r26=(SI)r24+48
insn 10: (QI)r27=(QI)((SI)r26)[3]
insn 16: (QI)r1=(QI)r27

insn 3: r3=r1
insn 4: r4=r2
insn 9: (SI)r3+=48
insn 10: (QI)r1=(SI)4 <- this is wrong, should be (SI)r3

Wednesday, March 21, 2012

Ok, this is wierd. I was working on a hex printing routine, and was seeing some strange behavior, so I've isolated it below:

void print_hex(long val, char a_base)
char *p=number_buf;
unsigned char a = ((val & 0x0F) + '0');
*p++ = a;

This results in this assembly:

mov r5, r1
swpb r1
movb r1, @number_buf
b *r11

If "val" is defined as an "int" all is well:

li r2, >F
mov r2, r2 <-- Wierd, but OK I guess
inv r2
szc r2, r1
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11

I'm totally confused.

The wierd MOV instruction was from the "andhi3" pattern. The idea was to move data into the correct temporary register for the SZC instruction. This move was inserted every time SZC is used, even if it would be pointless, as in the example above.

I've reworked the word and byte "and" patterns to emit optimal code. As a side-effect, some of the gyrations needed to handle SZC have been reduced. Here's the new code for this function:

andi r1, >F
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11

Much better. Now back to bizarro-ness.

The problem shows up in step 182r.peephole2, but I'm not sure why these instructions were deleted. I see that there are notes like "DCE: Deleting insn 25". DCE stands for Dead Code Elimination, but the code in question is not dead. Or at least I don't think it is.

This is the first deleted instruction:

(insn 13 12 14 2 printhex.c:17 (set (reg:SI 3 r3 [27])
(plus:SI (reg:SI 3 r3 [26])
(const_int 48 [0x30]))) 56 {addsi3} (expr_list:REG_UNUSED (reg:SI 3 r3 [27])

This would be equivalent to the "ai r1, >30" instruction in the working int case. For some reason, this code is marked as unused, and removing this instruction causes all prior code to be marked and deleted as well.

Now I just need to find out why GCC think this is unused.
More time spent working on printf. I found a case where the fake PC register snuck into the allocator by GCC attempting to hold the low part of a 32-bit value starting in R15.

This was fixed by changing the HARD_REGNO_MODE_OK macro to disallow 32-bit values from being stored in R15. This ensure that only 16-bit values can only be used in the fake PC register, and we already ensure that only happens as part of a function epilogue.

Tuesday, March 20, 2012

Lucien found more problems.

The first is a result of breaking up his main.c file into two pieces. I don't have internet access right now, so I can't work on this. The other problem was this code omitting the test for KEY's value:

#define KEY (*(volatile char*)0x8375)
int test1(int rpt){
if(KEY!=0xFF && !rpt) return(0);
else return(1);

This was just an unexpected enforcement of type limits. The proposed check against 0xFF is beyond the type limits of KEY and so need never be checked, since it would always return false.

Fixed output:

movb @>8375, r2
seto r3
cb r2, r3
jeq L7
clr r2
mov r1, r3
jne L8
mov r2, r1
b *r11
jmp L9
li r2, >1
mov r2, r1
jmp L3
li r1, >1
b *r11

Smaller code size results from this however:

#define KEY (*(volatile char*)0x8375)
int test2(int rpt){
if(KEY==-1 || rpt==0) return(1);
else return(0);

mov r1, r2
movb @>8375, r1
seto r3
cb r1, r3
jeq L6
clr r1
mov r2, r3
jeq L6
b *r11
jmp L8
li r1, >1
b *r11

The first form has three exit points(KEY==-1, KEY!=-1&&rpt==0, else clause), and has three code fragments for setting the return values. The second form only has two exits (KEY==-1||rpt==0, else clause), and so has less code. I'll generalize this and assume that AND conditions generate more code that their equvalent OR tests. I should do more tests to confirm that, but I feel pretty good about that assumption.

In the process of investigating this, I saw that the recent fixes for subract broke a peephole optimization for comparisons against small constant values (+-2 and +-1). Fixed now.

I tought I might be able to get faster code for some byte compare conditions, but I was wrong. The idea was to use SWPB to convert to a word value and then use INC-like instructions to set the flags. This is a bad idea. More cycles would be wasted in the setup to properly set the upper byte than would be saved by not using CB. Poop.

The best I can do is to use CI with cleverly-chosen constants for inequality compares (>, >=, <, <=), and MOVB for compares with zero. All of which I'm already doing.

Thursday, March 15, 2012

Posted on AtariAge:

Well, after a really long time without any signs of life from this project, it's patch time.

A big "thank you" goes out to Lucien. A lot of the updates here are a direct result of the effort he put into making Rush Hour. He did a great job wading through all the brokenness to make a functional game. Now it's time for everyone to benefit from that work.

New Binutils fixes in this release:

STST was incorrectly looking for two arguments
SBO, SBZ and TB incorrectly using constants
EQU'ed symbols sometimes replaced using wrong endianness

GCC fixes:

Fixed several word-to-byte conversion errors
Fixed "unrecognizable instruction" for zero comparison operations
Made optimizations for most comparison operations
Improved correctness of condition flag handling
Switch statements now work properly
Fixed divison and modulus, operands were used in wrong order
Fixed subtract, operands were occasionally used in wrong order
Fixed stack frame corruption when local variables are in use
Added optimizations for forms like (int Y)=((int)(char X))<
The patch and build procedures are the same as always. Development notes are on my blog for those who are interested.

Things are shaping up pretty well so far. (Yes it is taking forever, sorry about that.) I don't see any obvious holes to fill, or optimizations yet to do. At this point, I just need to exercize the compiler with larger programs and increase test coverage. If anyone finds a problem, or sees an area where improvements can be made, please let me know.

I'm continuing to work on related projects (disk management tool, libc library, documentation). There's still lots to do, so these updates will keep coming

Monday, March 12, 2012

I've been going through the compiled code looking for more opprotunities for optimizations. I found this sequence, which looks interesting:

(int y) = ((int)(charx))<<8
sra r2, 8 12+2*8+4=32
sla r2, >8 12+2*8+4=32
total: 64 clocks, 4 bytes

This could be replaced with:
andi r2, >FF00 14+8+4=26
total: 26 clocks, 4 bytes

more generally:
(int y) = ((int)(charx))<< N
sra r2, 8 12+2*8+4 = 32
sla r2, N 12+2*N+4 = 16+2*N
total: 48+2N clocks, 4 bytes

sla r2, N-8 12+2*N-2*8+4 = 2*N
andi r2, 0xFFFF< total: 26+2N clocks, 6 bytes

This looks pretty darn good, about 33% faster on average.
Truth table below:

N Original pattern Result Optimization
- ----------------- ----------------- ----
0 01234567.xxxxxxxx -> xxxxxxxx.01234567 swpb
1 01234567.xxxxxxxx -> xxxxxxx0.1234567x >>7
2 01234567.xxxxxxxx -> xxxxxx01.234567xx >>6
3 01234567.xxxxxxxx -> xxxxx012.34567xxx >>5
4 01234567.xxxxxxxx -> xxxx0123.4567xxxx >>4
5 01234567.xxxxxxxx -> xxx01234.567xxxxx >>3
6 01234567.xxxxxxxx -> xx012345.67xxxxxx >>2
7 01234567.xxxxxxxx -> x0123456.7xxxxxxx >>1
8 01234567.xxxxxxxx -> 01234567.xxxxxxxx nop
9 01234567.xxxxxxxx -> 1234567x.xxxxxxxx <<1
A 01234567.xxxxxxxx -> 234567xx.xxxxxxxx <<2
B 01234567.xxxxxxxx -> 34567xxx.xxxxxxxx <<3
C 01234567.xxxxxxxx -> 4567xxxx.xxxxxxxx <<4
D 01234567.xxxxxxxx -> 567xxxxx.xxxxxxxx <<5
E 01234567.xxxxxxxx -> 67xxxxxx.xxxxxxxx <<6
F 01234567.xxxxxxxx -> 7xxxxxxx.xxxxxxxx <<7

Sunday, March 11, 2012

I don't know what happened, but that error doesn't show up anymore. I didn't make any compiler changes, so that bug could happen again. Unfortunately (for bug hunting) there is a pass in GCC to shorten branches. That pass automatically reorganizes the code to make optimally shorter jumps. That means forcing a condition where this bug can occur is tricky.

I ran a bunch of tests to try to recreate this problem, but came up short. I might have to just wait until a repeatable test case is found. That makes me nervous, but I'm not sure what else to do.

Friday, March 9, 2012

I've decided to see how far I can get without installing new software.

The problem seems to always be related to stuff in show_score. I've changed that function to just do this:

void show_score() {
char s[7];

if(score) {

ai r10, >FFF2 allocate 14 bytes
mov r10, r0
mov r11, *r0+ sp[0]=r11
mov r9, *r0+ sp[2]=r9
mov r13, *r0+ sp[4]=r13
mov @score, r1 r1=score
jeq L193 if score==0 goto L193
li r9, itoa r9=itoa
mov r10, r13 r13=sp
inct r13
mov r13, r2 r2=&sp[2] (should be &sp[6], r9, r13 lost)
bl *r9
mov @score, r1
mov r13, r2
bl *r9
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
ai r10, >8
b *r11

It looks like the location of the stack variables is being calculated wrongly. Time to dig into that.

The problem is that the stack code replaced the local frame with the stack pointer plus offset, the offset was incorrectly calculated. The size of he saved registers were not taken into consideration. Now that's taken care of, things are much better.

I'm a terrible liar. Building from scratch resulted in this mess:

main.o: In function `show_score':
(.text+0x9e5): relocation truncated to fit: R_TMS9900_PC8 against `.text'
make: *** [all] Error 1

000009d0 :
9d0: 02 2a ff f4 ai r10, >FFF4
9d4: c6 8b mov r11, *r10
9d6: ca 89 00 02 mov r9, @>0002(r10)
9da: c0 60 00 28 mov @>0028, r1
9de: 13 00 jeq 0
9e0: d1 6a 00 04 movb @>0004(r10), r5
9e4: 13 00 jeq 0
9e6: c2 4a mov r10, r9
9e8: 02 29 00 04 ai r9, >0004
9ec: c0 49 mov r9, r1

Here's that relocation:

000009e5 00000101 R_TMS9900_PC8 00000000 .text + ae6

This results in a displacement of 0x101, which won't fit in the JEQ instruction. This is disappointing. I'll have to find some way to distinguish short and long jumps and stick that in the compiler.

check out s390_shorten_branches for possible ideas also look for "branch" and "range" in the other machine config files.

Thursday, March 8, 2012

I've gone through the Rush Hour code, and all the awkward workarounds have been backed out except for a problem with this line:


For some reason this is being converted to:


If strlen is defined as an external funciton, or a variable is used instead of a constant, this works properly. So, lets follow the development of this code...

from 128r.expand:
(insn 23 22 24 7 emw_test.c:5 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) -1 (nil))

(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (nil))

(insn 23 22 24 7 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (expr_list:REG_DEAD (reg:HI 26)

(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 23 [ prephitmp.56 ]))) 62 {subhi3} (nil))

This is correct so far, but something changes in step 172 during register allocation.

from 172r.ira:
(insn 23 22 24 emw_test.c:5 (set (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(minus:HI (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(const_int 32 [0x20]))) 62 {subhi3} (nil))

This same pattern shows up even in this simple example:

int sub_seven_word_r(int a) {return(7-a);}

So it turns out I had a typo in the machine description. I mistakenly told GCC that subtraction is commutative, and that operands can be swapped if that will result in better code. However once I fixed that, I started seeing this for the simple examples:

li r2, >7
s r1, r2
mov r2, r1
b *r11

li r2, >100
sb r1, r2
movb r2, r1
b *r11

li r1, >100
sb @memb_a, r1
movb r1, @memb_a
b *r11

mov @memw_b, r1
s @memw_a, r1
mov r1, @memw_a
b *r11

This could be made better:

addi r1, -7
neg r1
b *r11

li r2, >100
sb r2, r1
neg r1
b *r11

li r2, >100
sb r2, @memb_a
neg @memb_a
b *r11

s @memw_b, @memw_a
neg @memw_a
b *r11

So at this point, all of Lucien's issues have been addressed, and Rush Hour has had all the workarounds replaced with correct code. Well, all except for the VDP code which was the first thing I looked at. I have no idea what's going on here, maybe it's a stack corruption problem.

The problem seen when the VDP workarounds are backed out is that after moving the first piece, any keypress which would move the cursor causes the program to restart. I'll probably need to install Classic99 to examine this stack issue.

Tuesday, March 6, 2012

I fixed a problem with division and modulus. Apparently I shouldn't be allowed to read data sheets anymore. The operands were swapped, and the wrong byte of the return value was returned as a result (A/B retuned B%A, A%B returned B/A).

What makes me sick is that I specifically went over all this earlier, and made a point of spending lots of time to make this right. I failed.

At least it seems to work now.

Monday, March 5, 2012

I took a detour for a minute and decided to make what I thought was an easy change and add cases to the switch statement in the input loop. Now I've got problems. Backing out vdp copy code changes for now.

To start with, there were a lot of errors like "(.text+0x15dc): undefined reference to `.L388'". That was fixed by changing the case label format in the compiler. The common unnamned label looks like ".L192", but the "." symbol is not supported in the TI assembler. Earlier, I changed to the "L192" format, but I missed this usage. The mismatch resulted in all the link errors.

That done, all key checks in the switch no longer work. I need to analyse the jump table to see what the heck is going on here.

switch labels:
data L386 - default
data L387 - 'S', 8,83, 0x08,0x53
data L388 - 'D', 9,68, 0x09,0x44
data L389 - 'X',10,88, 0x0A,0x58
data L390 - 'E',11,69, 0x0B,0x45
data L391 - ' ',13,32, 0x0D,0x20

jump calculation:
movb r9, r1
ai r1, >F800 <-- R1-=8, (min case=8)
ci r1, >50FF <-- check for max case (0x58-8=0x50)
jle JMP_13
b @L386 <-- beyond case extent, use default
srl r1, 8 <-- convert to int, 0<=R1<=0x50
a r1, r1 <-- convert to byte offset
b @L392(r1) <-- jump to case

Jump table:
index 0,75 -> L387 -> S
index 1,60 -> L388 -> D
index 2,80 -> L389 -> X
index 3,61 -> L390 -> E
index 5,24 -> L391 -> space
all else -> L386

After looking at this code for a long time, I think I know what's going on.

It's the instruction "b @L392(r1)". I looked through the E/A manual and the 9900 data sheets, and as far as I can tell, the next instruction exectued wil be the one at L392+r1. This seems obvious, but I'm used to working with machines which would instead load the value at this address, then jump to the loaded address. I need to change that to:

mov @L392(r1), r1
b *r1

That means more digging into GCC. I'd also like to find the code which decides when to use a jump table. This example seems like it would be smaller with nested if's

Well, the switch vs. if code was not found, but the correct branch code has been implemented, and works just fine.

Sunday, March 4, 2012

I got ahold of version three of Rush Hour, which contains all Lucien's fixes, and is considered the final code.

A disk image was made of the compiled, unmodified code. Seems to work just fine. Now I can start removing his workarounds and restore the code he originally meant to write.

First step, removing the unneeded "mov r1, r1" from vdp_copy_from_sys.
Works, but the cursor is corrupted after moving a piece. Hmm.

Dump of original code:

00000000 :
0: c0 41 mov r1, r1
2: c0 41 mov r1, r1
4: c0 41 mov r1, r1
6: c0 41 mov r1, r1
8: a0 c2 a r2, r3
a: c1 01 mov r1, r4
c: 06 c4 swpb r4
e: d8 04 8c 02 movb r4, @>8C02
12: 02 61 40 00 ori r1, >4000
16: d8 01 8c 02 movb r1, @>8C02
1a: 80 c2 c r2, r3
1c: 14 00 jhe 0

This seems to be inlined all over the place, so I need to look closely at all references.

Friday, March 2, 2012

I've found the problem in my disk tool. The three-byte cluster record in the FIB block was being improperly constructed. This caused the program loader to get nonsense sectors, resulting in a crash.

I confirmed this with a known-good disk image. The files it contained were extracted, and a new image made with those files. That new image ran without problems.

I spent some more time doing cleanup and comments, and exercised the options a bit more to gain confidence in the results. The disk tool still needs some work though. For instance, data file handling is incomplete, handling disk-full conditions is iffy at best, and using the tool at the command line is awkward, with lots of easy-to-forget switches.
Since I've totally forgotten what I was doing, I've decided to go over Rush Hour once again, looking for evidence of my shame.

Everything compiles without error with the latest changes, but I need to execute the new image to truly confirm this.

That means I need to make a disk image. Hopefully that tool was left in good shape.

Unfortunately, no. The resulting disk will not load in E/A, and crashes Disk Manager. That's bad.

Tuesday, February 28, 2012

The past few months have been super busy, and this is the first time I've had a chance to stop and really work on this code since September.

When last I was here, I was looking at a bad subreg of a symbol_ref expression. I had also backed out all the front end changes for some reason. All that code's going back.

The fix for the bad subreg is to defer action until the location of that symbol is known. The gen_lowpart_general function is called in a lot of compilation steps, so later on when we know if the symbol data is stored in a register or memory, we can determine the proper expression to use.