Thursday, September 26, 2013

Moving on, there is a missing typecast

unsigned char Round;
#define VDPWD *((volatile char*)0x8C00)
void modulo()
{
  VDPWD = '0' + Round % 10;
}

This compiles to:

modulo
    movb @Round, r1    * get the variable into R1 MSB
    srl  r1, 8         * move to LSB
    mov  r1, r2        * copy to R2
    clr  r1            * zero R1 (32-bit value is now >000000xx)
    li   r3, >A        * load R3 with 10 (divisor)
    div  r3, r1        * do the division (R1 has dividend, R2 has remainder)
    mov  r2, r1        * copy R2 to R1 - however, remainder was 16 bit! We are about to treat it like 8-bit.
* Missing "swpb r1" here
    ai   r1, >3000     * add '0' as an 8-bit byte value - this is wrong <--- br="">    movb r1, @>8C00    * move the result to the video chip

Looking through the RTL, we find this sequence in 131r.initvals:

(insn 9 8 10 3 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
        (plus:QI (subreg:QI (reg:HI 25) 1)
            (const_int 48 [0x30]))) -1 (nil))

In assembly, this would be:

swpb r1  * <-- ah="" br="" ha="" instruction="" is="" missing="" our="" this="">ai r1, >3000

Somewhere in step 172, register allocation, the type conversion gets lost

(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])
        (plus:QI (reg:QI 1 r1 [orig:21 D.1197 ] [21])
            (const_int 48 [0x30]))) 59 {addqi3} (nil))

In assembly, this would just be:

ai r1, >3000

Poop, we've found yet another case where GCC assumes that no effort is required to switch between 8-bit and 16-bit values. In order to fix this, we're going to have to dig deep into the guts.

Wednesday, September 25, 2013

I've been busy getting up to speed with a new job lately, but I finally got some time to do some TI stuff.

While I was away, Tursi found a few bug in GCC that need to be addressed. These range from simple fixes to potentially long and painful sessions tracking down where a bytewise operation gets broken. Most of these were submitted by Tursi on the AtariAge forum, so thank you for finding these.

Let's get cracking!

First up is an easy fix for jeq/jlt and jeq/jgt pairs. There are no instructions to express "jump if arithmetically greater/lesser or equal to" like there are for logical comparisons. Those would be JLE and JHE. Instead, we must chain two conditional jumps together to get the same result. The conditional flags are not affected by any of the jump instructions, so we won't get weird behavior by doing this. I made an effort to encourage GCC to not use these combinations when possible, but sometimes there's just no avoiding the situation.

For simplicity's sake, I'll only refer to the jeq/jlt pair here. The jeq/jgt pair has the same issue, just with a different test.

I originally had arbitrarily chosen to test for equality first then for a lesser value. Everything worked, so I moved on to other issues and didn't think about it. However Tursi recommended doing the lesser test first. The idea here is that the less-then test will match a larger set of possible conditions than the equality test, so by doing that one first, the second test may not be needed and performance is increased.

Next up, we're missing an ANDI instruction in the compilation of this snippet:

extern int byte2hex[];
void fast_hex_out(int x) {
        unsigned char z=(x&0xff);
        unsigned int dat = byte2hex[z];
        VDPWD = dat>>8;
        VDPWD = dat&0xff;
}

This compiles to:

fast_hex_out
* Should be "andi r1, >00FF" here
        a    r1, r1
        mov  @byte2hex(r1), r2
        mov  r2, r3
        li   r1, >8C00
        movb r3, *r1
        swpb r2
        movb r2, *r1
        b    *r11

So, after digging into the RTL, we see that at one point, GCC intended to put that instruction there, but it got stripped out.

From step 182r.peephole2:
(insn 38 37 10 2 tursi4.c:15 (and (reg:HI 1 r1 [30])
        (const_int 255 [0xff])) -1 (nil))

Wait a second... THIS INSTRUCTION DOES NOT SET ANYTHING! IT'S TOTALLY USELESS! Of course it's being deleted.
This is the result of a bad peephole. GCC would never do anything this dumb on its own. Backing up a bit, we find the pre-peephole instructions:

From 172r.ira:
(insn 7 4 9 2 tursi4.c:15 (set (reg:QI 1 r1 [orig:27 x+1 ] [27])
        (subreg:QI (reg:HI 1 r1) 1)) 73 {movqi} (nil))

(insn 9 7 10 2 tursi4.c:15 (set (reg:HI 1 r1 [30])
        (zero_extend:HI (reg:QI 1 r1 [orig:27 x+1 ] [27]))) 75 {zero_extendqihi2} (nil))

This translates to:
   (char)r1 = (char)((int)r1)
   (int)r1 = (int)((char)r1)

With all this in mind, I found the broken peephole optimization. When I was looking at it, the problem was obvious. I feel really silly right now.

Thursday, May 2, 2013

GCC 4.4.0 patch 1.8

Changes this version:

Fixed R11 restoration in epilogue being dropped by DCE
Added support for named sections
Removed support for zeroing memory bytes, was buggy with some devices

Download:
gcc-4.4.0-tms9900-1.8-patch.tar.gz

Binutils 2.19.1 patch 1.5

Changes this version:

Added more informative syntax error messages
Fixed values like ">6000" in strings being mangled
Confirm support for named sections

Download:
binutils-2.19.1-tms9900-1.5-patch.tar.gz

Sunday, April 28, 2013

I was looking into the problem Tursi reported where the link register is not properly restored. The issue here is that a few earlier improvements are combining to cause this. An earlier change was to merge modifications of the stack pointer in an effort to improve performance. Another one was to make R11 be a volatile register, whose value will be destroyed after a function call. The problem here is that since the example program uses stack space, but only needs to restore R11, an instruction like "mov *r10, r11" was issued. The dead code elimination pass saw this as a useless instruction, since R11 seemed to have no further use, and the stack pointer in R10 was not changed.

Here's what I'm talking about:

          ...
        bl   @puts
        mov  *r10, r11  # Instruction deemed unnecessary and deleted by DCE pass
        ai   r10, >6
        b    *r11

The fix here was to modify the invocation of the UNSPEC_RETURN instruction to use R11 as an input. This causes the dead code elimination pass to see that "mov *r10, r11" serves a useful purpose and is not to be removed. One bug down...

In other news, I've confirmed that the -ffunction-sections and -fdata-sections options work, as well as the section attribute. All this confirms that named section suppport is working as intended. The elf2card and elf2ea5 tools won't do anything with them, but I assume that linker files will be used to combine them into the standard .text .data and .bss sections.

The final bug is clearing a memory byte. Tursi is working on hardware which is uses write-only registers. The problem for him is that code like:
    *(volatile char*)0x9c02 = 0

results in an instruction like:
    sb @>9c02, @9c02

If the destination lies in RAM, this is fine. But write-only hardware could potentially return different values for each of the two memory fetches. In this case, it would be more appropriate to use something like this:
    clr r0
    movb r0, @>9c02

This is the same number of code bytes (6), and only a few cycles slower. Not too bad. Fixing this is really easy too. There was a mode for that instruction to specifically permit directly setting memory to zero. By removing that permission, everything works as intended.

Sunday, April 14, 2013

I've been away from TI stuff for a long time, but I can some work done today.

On AtariAge, Tursi noted that strings like ">6000" were mysteriously being converted to "$6000". That was due to fairly dumb text replacement code in GAS. Since GAS doesn't understand hex constants like >6000, and ">" is treated as a logical operator, I needed to transform the incoming text before it gets parsed for assembly. Unfortunately, I didn't make sure these constants in text strings were not affected.

So that's now fixed.

I started looking into named sections, and it turns out that I don't really need to do any work here. GAS already has support for ".section" and the GCC support just had to be uncommented in the config file. Handy!

Friday, February 15, 2013

OK, the changes to GAS parsing is done. I think this should make it easier to find problems. So here's a test file with lots of errors:

    eric@compaq:~/dev/tios/src/gas_test$ cat gastest.asm
        mov r1
        mov r1+,r2
        mov r1,r2+
        inv r1, r2
        inv r

And here's the result from the new parsing code:

    eric@compaq:~/dev/tios/src/gas_test$ tms9900-gas gastest.asm
    gastest.asm: Assembler messages:
    gastest.asm:1: Error: unexpected end of line
    gastest.asm:2: Error: unexpected character at "+,r2"
    gastest.asm:3: Error: unexpected character at "+"
    gastest.asm:4: Error: unexpected character at ",r2"
    Invalid register starting at "r"
    gastest.asm:5: Error: missing argument

Could this be better? Maybe, but it's a lot better than before. Time to get back to floating point code.

Saturday, February 2, 2013

I've been slacking off keeping this up to date. Lately I've been working on floating point support for GCC. I've got float-to-int and int-to-float working, and I'm trying to get addition working.

While working on that I found more annoying error messages in GAS. This line:

    mov r10+, r11

gives this error:

    test.asm:381: Error: missing comma separator

This is not helpful, especially since there is a comma in the expected location. What's really wrong is the missing "*" character. Unfortunately, We cannot make GAS smart enough to recognize that. We could look for junk after the last expected character, but we can't do that either since the TI format allows for junk at the end of the line. We could look for white space and do something smart, but GAS squeezes that out before our code is called.

Unfortunately, the best we can do is flag the error with as much information as we can. This is more descriptive and shows what the real problem is.

    test.asm:381: Error: unexpected character at "+,r11"

Another problem I have, but cannot fix without violating TI comment rules, is if we get lines like this:

    mov r11, r10+
    inv r1, r2, r3

The trailing invalid characters are ignored, no error will be emitted, and the assembled instructions will actually be this:

    mov r11, r10
    inv r1

This is totally unexpected behaviour, and confusion and anger may follow.

Wednesday, January 16, 2013

GCC 4.4.0 patch 1.7

Changes this version:

Libgcc now built for TMS9900
Implemented optimized assembly functions in libgcc for:
  count leading zero bits
  count trailing zero bits
  find index of least significant set bit
  count set bits
  calculate parity bit
  signed and unsigned 32-bit division and modulus
Fixed 32-bit multiplies, was only doing unsigned multiply
Fixed 32-bit negation, was emitting invalid NEG instruction
Removed fake PC register (Yay!)
New build instructions to make libgcc
Fixed function prologue and epilogue, was saving R11 unnecessarily
Optimized function epilogue, saving a few cycles
Enforced correct use of R11 register, was causing randomly broken code

Download:
gcc-4.4.0-tms9900-1.7-patch.tar.gz

Sunday, January 13, 2013

So, I'm back to debugging what went wrong with Rush Hour. It turns out that in the show_wall function, R11 is being used without being saved first. When the function returns, execution goes off into the woods and we're hosed.

000007f0 :
     7f0:       02 01 46 00     li r1, >4600
     7f4:       d8 01 8c 02     movb r1, @>8C02
     7f8:       02 0b 40 b0     li r11, >40B0
# R11 used before being saved
     7fc:       d8 0b 8c 02     movb r11, @>8C02
     800:      06 cb               swpb r11
     802:      d8 0b 8c 00     movb r11, @>8C00
     806:      02 01 47 00     li r1, >4700

The problem seems to be that for some reason df_regs_ever_live_p() does not reflect the usage of R11 so we don't know that we should save it.

The sequence which uses R11 was added as part of a peephole replacement, and all other registers allocated this way are not marked as used either. Apparently, GCC tries to allocate an unused register in preference to reusing one previously allocated. Interesting.

I started looking into the peephole handling code, but realized that R11 was being allocated due being designated as volatile, and all other volatile register were already used. After thinking about this a bit, I can fix this much more easily by just changing R11 to be preserved across function calls.

Ugh, still have problems. So now GCC thinks R11 is preserved, and tries to put values there to be preserved across a function call. Obviously, those values are destroyed after the return. So R11 has instead been marked as a special-purpose register and will not be considered for allocation. I don't like this, since R11 could legitimately be used if only we knew when to preserve it. At least everything works now.

I also looked into the reported stack size problem, where stack values were overwriting memory used to save register values. The problem there was that there were several places in tms9900.c which calculated which registers were to be saved, and they did not all agree. This discrepancy resulted in space for R11 being included for the prologue and epilogue, but not included for the stack variables. The register value stored in the overlap region got destroyed as a result.

The work that was done for optimizing the prologue and epilogue also included merging all the work to determine which registers to save into one function. This improved code clarity, and removed the ambiguity which lead to this problem.

As a side note, there is a libiberty library in binutils, but it's not especially useful. In the bundled README file, it is stated that this is a collection of random useful functions. It also states that these routines are highly system-dependant. It's also filled with a bunch of routines which don't seem to be very useful on a target system. I guess I still need to finish my libc implementation.

So at this point, all the bug reports have been addressed, and all my milestone goals have been reached. I suppose this is a good time to put together another release.

Saturday, January 12, 2013

I was thinking about alloc() lately, so I spent some time investigating how I could pull it off. The short answer is: I can't. The longer answer is: I shouldn't try.

Alloca is fabled for having buggy implementations with lots of tricky edge cases which are not handled well. Beyond this, there's the fact that we're tight on stack space as it is. Opening that space up for potential abuse is just asking for trouble.

I suppose I could come up with a scheme to pull it off anyway, but the stack frame I'm currently using is not friendly for such a thing. I'd have to either make an alternate stack frame for functions which use alloca or simply use another one altogether. Since there are lingering problems with malformed stacks, I don't think it would be wise to make things any more complicated than they already are. At least for now.

That being said, I was looking at the prologue and  epilogue code currently in use. There are opportunities for optimization I should look at. The basic idea is that we increment the stack pointer as we restore registers. If local stack is used, there is a final adjustment to correct for that.

There is a four cycle cost for each of those increments. The current code attempts to fold the final increment into the adjustment for the final stack frame adjustment, but only actually does this in rare circumstances.

There was also a pointless "ai r10, 0" instruction emitted in the prologue if no stack usage was used.

That's all easy to fix, but I'm wondering if it's better to not do the folding for small stack sizes. Let's consider the case where one register is saved, and we have two bytes for local usage.

Case 1: Fold increment into stack adjustment
mov *r10, r11  # 4+14+4=22 clocks
ai r10, 4           # 4+14+4=22 clocks
Total: 44 clocks, 6 bytes

Case 2: Preserve increment
mov *r10+, r11  # 4+14+8=26 clocks
inct r10               # 4+10=14 clocks
Total: 40 clocks, 4 bytes

OK, that's a pretty clear win for not folding. I'll get on that.

Thursday, January 10, 2013

More disappointment. I found the disassembly code in MESS and isolated it in a test program. It seems to work just fine. This really isn't a suprise, since that code's been around for a long time, and disassembly is pretty straightforward for any machine.

So, the question remains: what the heck is going on here? Haven't a clue.

Rather then invest time debugging MESS, I'll just write that off and continue using code and disassembly analysis for my work.

Tuesday, January 8, 2013

So I was trying to run a test build of an older version of Rush Hour and decided it was time to use the debug facilities of MESS. So far, I've done all my TI work without using any debuggers and I thought it might be nice to cheat a little. Sadly, MESS is acting acting a bit weird.

I set some breakpoints at the start of the cartridge, and they didn't take. After fighting with that for a while, I tried running a simple test cart I was using earlier. That cart eventually hits a "jmp $" instruction and stops. I figured I could at least confirm that the cart is being loaded into the right location.

I think I found the problem. The debugger seems to be misinterpreting the instructions in the disassembly window. Maybe this is causing the breakpoints to be ignored since MESS thinks the address is in the middle of a multiword instruction. Below I've compared the same chunk of code as seen in the debugger window and the objdump disassembly.

From MESS (incorrect):
60C2: 10FF                     socb *r0, *r12+
60C4: 064A C68B           szc r6, @>8bc6(r8)
60C8: 06A0                     a r6, r0
60CA: 6768 C0C1 0702  s @>c1c0(r7), @>0207(r1)

From objdump (correct):
60c2: 10ff              jmp -2
60c4: 064a            dect r10
60c6: c68b            mov r11, *r10
60c8: 06a0 6768   bl @>6768
60cc: c0c1            mov r1, r3

So it looks like I should take a look at the MESS sources and see what's going on here. Most likely, there's something misconfigured somewhere.

Sunday, January 6, 2013

After looking at avr-libc, I've decided that it won't work either. There are a lot of assembly modules in there, as well as several modules which are hardware-dependant, all of which would have to be removed for the library to be of any use.

I could make a new library using all the relevant parts of avr-libc, but it uses a modified BSD license, which I would have to maintain for any extensions I made later. I don't have a problem with the BSD license, and it's probably the right one to use, but I don't want to deal with legalisms at this point.

Months earlier, I started working on my own libc implementation. At this point, it looks like my best bet is to just complete that project and call it a day.

By the way, there is some nice documentation on the libc functions at http://nongnu.org/avr-libc/user-manual/modules.html. That could be very handy for coverage checks.

Saturday, January 5, 2013

Over the last few days I've been trying to get a libc library working for the TMS9900.

I started with libiberty, but got stuck trying to compile regexp.c. There were a lot of missing header files, and I wasn't able to find functional replacements. I gave up on it because even if I got past regexp, there were a lot of OS-related functions which would need to be removed. These are functions like file IO, syscalls, threading and timezones. Actually, after all that was removed, there wouldn't be much left.

So next I tried uclibc. This is a popular replacement for the standard glibc library, and is often used for embedded applications. There are a lot of configuration options, and many categories of functions can be simply disabled by using its configuration menu. This seemed like uclibc was perfect for our needs. I got as far as adding the TMS9900 as a valid target and running some test builds. Again I found problems. All of uclibc assumes 32-bit words. It also assumes that certain classes of algorithms will compile to small and fast code. Based of the strategies used, it seems to be aimed primarily at the ARM family of processors. I know from past experience that the ARM instruction set encourages heavy use of 32-bit shifts and integer operations, discourages branches and nearly all instructions execute at the same rate. Memory is plentiful and fast, so large lookup tables are encouraged. ARM instructions have three operands: two inputs and an output, so equvalent code  for the TMS9900 would, if nothing else, add a lot of MOV instructions. This model is a really bad match for the TMS9900, where all 32-bit operations are costly, shifts are slow, branches are fast, memory is tiny and slow, and there is a lot of variation in instruction speed. Ultimately, the resulting code would be huge and slow, even after the word size issues were resolved. I gave up and moved on.

After some research, I came across avr-libc. This is is a libcc replacement for the Atmel family of processors. These are low power 8 and 16 bit processors intended solely for embedded applications. They are most commonly found in Arduino development boards, and typically do not use an operating system. The instruction set is limited, and shares many aspects with the TMS9900. There is also a lot of documentation for the library and best practice guides. This seems encouraging.

Wednesday, January 2, 2013

I was reading some GCC bug reports and I found a more correct way to build libgcc:

  $ mkdir build
  $ cd build
  $ ../configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
  $ make all-gcc all-target-libgcc
  $ make install

This is way easier and a lot more convenient than the directions I had before. I think that as long as I'm doing libraries, I'll try libiberty and see what happens.
So now I've excised all the fake PC stuff, and I feel much better about the situation.

I've been looking through the GCC internal documentation looking for additional features I may have missed which would help. Unfortunately, I didn't notice anything I haven't tried before. There were a few macros that caught my eye, but sadly are of no use for this project.

SUBREG_PROMOTED_UNSIGNED_P
SUBREG_PROMOTED_UNSIGNED_SET
SUBREG_PROMOTED_VAR
These macros defines how byte values are converted to word values when stored in registers, which sound like it would be perfect for the crazy byte representation the TMS9900 uses. Unfortunately, this isn't useful. These macros can only select between signed and unsigned extension. Regardless of the selection made, GCC assumes that the byte value can be retrieved from the least significant 8 bits of the register. This is no good.

HARD_REGNO_MODE_OK
MODES_TIEABLE_P
CANNOT_CHANGE_MODE_CLASS
These macros deal with which registers are used to store values, and how values can be converted between data types. Again, this sounds great for dealing with byte values. Again, they are not useful. GCC only uses these for values larger than a word. In our case, values larger than 16 bits. GCC reasonably assumes that any register which can be used to store word values can be used to store values smaller than a word. It also assumes that byte values are always stored in the least significant bits of any register used.

So, having gone through all the documentation again, the only way to have a proper GCC port is to follow the path we ultimately took: make changes to the internals of the compiler and force mode to changes be preserved until instruction generation. This is fine, but there are a lot of places where these mode changes can happen. We'll likely be dealing with faulty byte-to-word conversions for a long time as user code explores the edge cases. Ick.

Nothing left to do now but track down the reported bugs.

Tuesday, January 1, 2013

I'm about halfway through removing the fake PC register. Right now, there's a new "*rt" instruction using the UNSPEC framework to emit "b *r11".

The UNSPEC functions are intended to handle instructions which are otherwise undescribable using normal RTX expressions. This is perfect for the TMS9900 return sequence, since we can't use the normal return instruction. The reason for that is if a return instruction is defined, we cannot apply a function epilogue since that is implemented in the place where a normal return instruction would live. Also, we need a specific label for any type of explicit jump, which won't work for a return.

The fake PC register has been a worrying point for a long time, and we've been burned by it before. Removing it will eliminate all possibility for further unexpected messes.

All this is great, but now I have to remove all the hackery I used earlier to insert the fake PC register. That involves register definitions, location of the first pseudoregister, a special case for the MOV instruction as well as other special cases in the register allocation directives.

That's quite a bit of work, and it's late. We'll try that tomorrow.