Wednesday, January 16, 2013

GCC 4.4.0 patch 1.7

Changes this version:

Libgcc now built for TMS9900
Implemented optimized assembly functions in libgcc for:
  count leading zero bits
  count trailing zero bits
  find index of least significant set bit
  count set bits
  calculate parity bit
  signed and unsigned 32-bit division and modulus
Fixed 32-bit multiplies, was only doing unsigned multiply
Fixed 32-bit negation, was emitting invalid NEG instruction
Removed fake PC register (Yay!)
New build instructions to make libgcc
Fixed function prologue and epilogue, was saving R11 unnecessarily
Optimized function epilogue, saving a few cycles
Enforced correct use of R11 register, was causing randomly broken code

Download:
gcc-4.4.0-tms9900-1.7-patch.tar.gz

Sunday, January 13, 2013

So, I'm back to debugging what went wrong with Rush Hour. It turns out that in the show_wall function, R11 is being used without being saved first. When the function returns, execution goes off into the woods and we're hosed.

000007f0 :
     7f0:       02 01 46 00     li r1, >4600
     7f4:       d8 01 8c 02     movb r1, @>8C02
     7f8:       02 0b 40 b0     li r11, >40B0
# R11 used before being saved
     7fc:       d8 0b 8c 02     movb r11, @>8C02
     800:      06 cb               swpb r11
     802:      d8 0b 8c 00     movb r11, @>8C00
     806:      02 01 47 00     li r1, >4700

The problem seems to be that for some reason df_regs_ever_live_p() does not reflect the usage of R11 so we don't know that we should save it.

The sequence which uses R11 was added as part of a peephole replacement, and all other registers allocated this way are not marked as used either. Apparently, GCC tries to allocate an unused register in preference to reusing one previously allocated. Interesting.

I started looking into the peephole handling code, but realized that R11 was being allocated due being designated as volatile, and all other volatile register were already used. After thinking about this a bit, I can fix this much more easily by just changing R11 to be preserved across function calls.

Ugh, still have problems. So now GCC thinks R11 is preserved, and tries to put values there to be preserved across a function call. Obviously, those values are destroyed after the return. So R11 has instead been marked as a special-purpose register and will not be considered for allocation. I don't like this, since R11 could legitimately be used if only we knew when to preserve it. At least everything works now.

I also looked into the reported stack size problem, where stack values were overwriting memory used to save register values. The problem there was that there were several places in tms9900.c which calculated which registers were to be saved, and they did not all agree. This discrepancy resulted in space for R11 being included for the prologue and epilogue, but not included for the stack variables. The register value stored in the overlap region got destroyed as a result.

The work that was done for optimizing the prologue and epilogue also included merging all the work to determine which registers to save into one function. This improved code clarity, and removed the ambiguity which lead to this problem.

As a side note, there is a libiberty library in binutils, but it's not especially useful. In the bundled README file, it is stated that this is a collection of random useful functions. It also states that these routines are highly system-dependant. It's also filled with a bunch of routines which don't seem to be very useful on a target system. I guess I still need to finish my libc implementation.

So at this point, all the bug reports have been addressed, and all my milestone goals have been reached. I suppose this is a good time to put together another release.

Saturday, January 12, 2013

I was thinking about alloc() lately, so I spent some time investigating how I could pull it off. The short answer is: I can't. The longer answer is: I shouldn't try.

Alloca is fabled for having buggy implementations with lots of tricky edge cases which are not handled well. Beyond this, there's the fact that we're tight on stack space as it is. Opening that space up for potential abuse is just asking for trouble.

I suppose I could come up with a scheme to pull it off anyway, but the stack frame I'm currently using is not friendly for such a thing. I'd have to either make an alternate stack frame for functions which use alloca or simply use another one altogether. Since there are lingering problems with malformed stacks, I don't think it would be wise to make things any more complicated than they already are. At least for now.

That being said, I was looking at the prologue and  epilogue code currently in use. There are opportunities for optimization I should look at. The basic idea is that we increment the stack pointer as we restore registers. If local stack is used, there is a final adjustment to correct for that.

There is a four cycle cost for each of those increments. The current code attempts to fold the final increment into the adjustment for the final stack frame adjustment, but only actually does this in rare circumstances.

There was also a pointless "ai r10, 0" instruction emitted in the prologue if no stack usage was used.

That's all easy to fix, but I'm wondering if it's better to not do the folding for small stack sizes. Let's consider the case where one register is saved, and we have two bytes for local usage.

Case 1: Fold increment into stack adjustment
mov *r10, r11  # 4+14+4=22 clocks
ai r10, 4           # 4+14+4=22 clocks
Total: 44 clocks, 6 bytes

Case 2: Preserve increment
mov *r10+, r11  # 4+14+8=26 clocks
inct r10               # 4+10=14 clocks
Total: 40 clocks, 4 bytes

OK, that's a pretty clear win for not folding. I'll get on that.

Thursday, January 10, 2013

More disappointment. I found the disassembly code in MESS and isolated it in a test program. It seems to work just fine. This really isn't a suprise, since that code's been around for a long time, and disassembly is pretty straightforward for any machine.

So, the question remains: what the heck is going on here? Haven't a clue.

Rather then invest time debugging MESS, I'll just write that off and continue using code and disassembly analysis for my work.

Tuesday, January 8, 2013

So I was trying to run a test build of an older version of Rush Hour and decided it was time to use the debug facilities of MESS. So far, I've done all my TI work without using any debuggers and I thought it might be nice to cheat a little. Sadly, MESS is acting acting a bit weird.

I set some breakpoints at the start of the cartridge, and they didn't take. After fighting with that for a while, I tried running a simple test cart I was using earlier. That cart eventually hits a "jmp $" instruction and stops. I figured I could at least confirm that the cart is being loaded into the right location.

I think I found the problem. The debugger seems to be misinterpreting the instructions in the disassembly window. Maybe this is causing the breakpoints to be ignored since MESS thinks the address is in the middle of a multiword instruction. Below I've compared the same chunk of code as seen in the debugger window and the objdump disassembly.

From MESS (incorrect):
60C2: 10FF                     socb *r0, *r12+
60C4: 064A C68B           szc r6, @>8bc6(r8)
60C8: 06A0                     a r6, r0
60CA: 6768 C0C1 0702  s @>c1c0(r7), @>0207(r1)

From objdump (correct):
60c2: 10ff              jmp -2
60c4: 064a            dect r10
60c6: c68b            mov r11, *r10
60c8: 06a0 6768   bl @>6768
60cc: c0c1            mov r1, r3

So it looks like I should take a look at the MESS sources and see what's going on here. Most likely, there's something misconfigured somewhere.

Sunday, January 6, 2013

After looking at avr-libc, I've decided that it won't work either. There are a lot of assembly modules in there, as well as several modules which are hardware-dependant, all of which would have to be removed for the library to be of any use.

I could make a new library using all the relevant parts of avr-libc, but it uses a modified BSD license, which I would have to maintain for any extensions I made later. I don't have a problem with the BSD license, and it's probably the right one to use, but I don't want to deal with legalisms at this point.

Months earlier, I started working on my own libc implementation. At this point, it looks like my best bet is to just complete that project and call it a day.

By the way, there is some nice documentation on the libc functions at http://nongnu.org/avr-libc/user-manual/modules.html. That could be very handy for coverage checks.

Saturday, January 5, 2013

Over the last few days I've been trying to get a libc library working for the TMS9900.

I started with libiberty, but got stuck trying to compile regexp.c. There were a lot of missing header files, and I wasn't able to find functional replacements. I gave up on it because even if I got past regexp, there were a lot of OS-related functions which would need to be removed. These are functions like file IO, syscalls, threading and timezones. Actually, after all that was removed, there wouldn't be much left.

So next I tried uclibc. This is a popular replacement for the standard glibc library, and is often used for embedded applications. There are a lot of configuration options, and many categories of functions can be simply disabled by using its configuration menu. This seemed like uclibc was perfect for our needs. I got as far as adding the TMS9900 as a valid target and running some test builds. Again I found problems. All of uclibc assumes 32-bit words. It also assumes that certain classes of algorithms will compile to small and fast code. Based of the strategies used, it seems to be aimed primarily at the ARM family of processors. I know from past experience that the ARM instruction set encourages heavy use of 32-bit shifts and integer operations, discourages branches and nearly all instructions execute at the same rate. Memory is plentiful and fast, so large lookup tables are encouraged. ARM instructions have three operands: two inputs and an output, so equvalent code  for the TMS9900 would, if nothing else, add a lot of MOV instructions. This model is a really bad match for the TMS9900, where all 32-bit operations are costly, shifts are slow, branches are fast, memory is tiny and slow, and there is a lot of variation in instruction speed. Ultimately, the resulting code would be huge and slow, even after the word size issues were resolved. I gave up and moved on.

After some research, I came across avr-libc. This is is a libcc replacement for the Atmel family of processors. These are low power 8 and 16 bit processors intended solely for embedded applications. They are most commonly found in Arduino development boards, and typically do not use an operating system. The instruction set is limited, and shares many aspects with the TMS9900. There is also a lot of documentation for the library and best practice guides. This seems encouraging.

Wednesday, January 2, 2013

I was reading some GCC bug reports and I found a more correct way to build libgcc:

  $ mkdir build
  $ cd build
  $ ../configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
  $ make all-gcc all-target-libgcc
  $ make install

This is way easier and a lot more convenient than the directions I had before. I think that as long as I'm doing libraries, I'll try libiberty and see what happens.
So now I've excised all the fake PC stuff, and I feel much better about the situation.

I've been looking through the GCC internal documentation looking for additional features I may have missed which would help. Unfortunately, I didn't notice anything I haven't tried before. There were a few macros that caught my eye, but sadly are of no use for this project.

SUBREG_PROMOTED_UNSIGNED_P
SUBREG_PROMOTED_UNSIGNED_SET
SUBREG_PROMOTED_VAR
These macros defines how byte values are converted to word values when stored in registers, which sound like it would be perfect for the crazy byte representation the TMS9900 uses. Unfortunately, this isn't useful. These macros can only select between signed and unsigned extension. Regardless of the selection made, GCC assumes that the byte value can be retrieved from the least significant 8 bits of the register. This is no good.

HARD_REGNO_MODE_OK
MODES_TIEABLE_P
CANNOT_CHANGE_MODE_CLASS
These macros deal with which registers are used to store values, and how values can be converted between data types. Again, this sounds great for dealing with byte values. Again, they are not useful. GCC only uses these for values larger than a word. In our case, values larger than 16 bits. GCC reasonably assumes that any register which can be used to store word values can be used to store values smaller than a word. It also assumes that byte values are always stored in the least significant bits of any register used.

So, having gone through all the documentation again, the only way to have a proper GCC port is to follow the path we ultimately took: make changes to the internals of the compiler and force mode to changes be preserved until instruction generation. This is fine, but there are a lot of places where these mode changes can happen. We'll likely be dealing with faulty byte-to-word conversions for a long time as user code explores the edge cases. Ick.

Nothing left to do now but track down the reported bugs.

Tuesday, January 1, 2013

I'm about halfway through removing the fake PC register. Right now, there's a new "*rt" instruction using the UNSPEC framework to emit "b *r11".

The UNSPEC functions are intended to handle instructions which are otherwise undescribable using normal RTX expressions. This is perfect for the TMS9900 return sequence, since we can't use the normal return instruction. The reason for that is if a return instruction is defined, we cannot apply a function epilogue since that is implemented in the place where a normal return instruction would live. Also, we need a specific label for any type of explicit jump, which won't work for a return.

The fake PC register has been a worrying point for a long time, and we've been burned by it before. Removing it will eliminate all possibility for further unexpected messes.

All this is great, but now I have to remove all the hackery I used earlier to insert the fake PC register. That involves register definitions, location of the first pseudoregister, a special case for the MOV instruction as well as other special cases in the register allocation directives.

That's quite a bit of work, and it's late. We'll try that tomorrow.