Thursday, September 26, 2013

Moving on, there is a missing typecast

unsigned char Round;
#define VDPWD *((volatile char*)0x8C00)
void modulo()
{
  VDPWD = '0' + Round % 10;
}

This compiles to:

modulo
    movb @Round, r1    * get the variable into R1 MSB
    srl  r1, 8         * move to LSB
    mov  r1, r2        * copy to R2
    clr  r1            * zero R1 (32-bit value is now >000000xx)
    li   r3, >A        * load R3 with 10 (divisor)
    div  r3, r1        * do the division (R1 has dividend, R2 has remainder)
    mov  r2, r1        * copy R2 to R1 - however, remainder was 16 bit! We are about to treat it like 8-bit.
* Missing "swpb r1" here
    ai   r1, >3000     * add '0' as an 8-bit byte value - this is wrong <--- br="">    movb r1, @>8C00    * move the result to the video chip

Looking through the RTL, we find this sequence in 131r.initvals:

(insn 9 8 10 3 tursi5.c:6 (set (reg:QI 21 [ D.1197 ])
        (plus:QI (subreg:QI (reg:HI 25) 1)
            (const_int 48 [0x30]))) -1 (nil))

In assembly, this would be:

swpb r1  * <-- ah="" br="" ha="" instruction="" is="" missing="" our="" this="">ai r1, >3000

Somewhere in step 172, register allocation, the type conversion gets lost

(insn 9 8 11 2 tursi5.c:6 (set (reg:QI 1 r1 [orig:21 D.1197 ] [21])
        (plus:QI (reg:QI 1 r1 [orig:21 D.1197 ] [21])
            (const_int 48 [0x30]))) 59 {addqi3} (nil))

In assembly, this would just be:

ai r1, >3000

Poop, we've found yet another case where GCC assumes that no effort is required to switch between 8-bit and 16-bit values. In order to fix this, we're going to have to dig deep into the guts.

Wednesday, September 25, 2013

I've been busy getting up to speed with a new job lately, but I finally got some time to do some TI stuff.

While I was away, Tursi found a few bug in GCC that need to be addressed. These range from simple fixes to potentially long and painful sessions tracking down where a bytewise operation gets broken. Most of these were submitted by Tursi on the AtariAge forum, so thank you for finding these.

Let's get cracking!

First up is an easy fix for jeq/jlt and jeq/jgt pairs. There are no instructions to express "jump if arithmetically greater/lesser or equal to" like there are for logical comparisons. Those would be JLE and JHE. Instead, we must chain two conditional jumps together to get the same result. The conditional flags are not affected by any of the jump instructions, so we won't get weird behavior by doing this. I made an effort to encourage GCC to not use these combinations when possible, but sometimes there's just no avoiding the situation.

For simplicity's sake, I'll only refer to the jeq/jlt pair here. The jeq/jgt pair has the same issue, just with a different test.

I originally had arbitrarily chosen to test for equality first then for a lesser value. Everything worked, so I moved on to other issues and didn't think about it. However Tursi recommended doing the lesser test first. The idea here is that the less-then test will match a larger set of possible conditions than the equality test, so by doing that one first, the second test may not be needed and performance is increased.

Next up, we're missing an ANDI instruction in the compilation of this snippet:

extern int byte2hex[];
void fast_hex_out(int x) {
        unsigned char z=(x&0xff);
        unsigned int dat = byte2hex[z];
        VDPWD = dat>>8;
        VDPWD = dat&0xff;
}

This compiles to:

fast_hex_out
* Should be "andi r1, >00FF" here
        a    r1, r1
        mov  @byte2hex(r1), r2
        mov  r2, r3
        li   r1, >8C00
        movb r3, *r1
        swpb r2
        movb r2, *r1
        b    *r11

So, after digging into the RTL, we see that at one point, GCC intended to put that instruction there, but it got stripped out.

From step 182r.peephole2:
(insn 38 37 10 2 tursi4.c:15 (and (reg:HI 1 r1 [30])
        (const_int 255 [0xff])) -1 (nil))

Wait a second... THIS INSTRUCTION DOES NOT SET ANYTHING! IT'S TOTALLY USELESS! Of course it's being deleted.
This is the result of a bad peephole. GCC would never do anything this dumb on its own. Backing up a bit, we find the pre-peephole instructions:

From 172r.ira:
(insn 7 4 9 2 tursi4.c:15 (set (reg:QI 1 r1 [orig:27 x+1 ] [27])
        (subreg:QI (reg:HI 1 r1) 1)) 73 {movqi} (nil))

(insn 9 7 10 2 tursi4.c:15 (set (reg:HI 1 r1 [30])
        (zero_extend:HI (reg:QI 1 r1 [orig:27 x+1 ] [27]))) 75 {zero_extendqihi2} (nil))

This translates to:
   (char)r1 = (char)((int)r1)
   (int)r1 = (int)((char)r1)

With all this in mind, I found the broken peephole optimization. When I was looking at it, the problem was obvious. I feel really silly right now.