Tuesday, November 30, 2010

Added optimizations for (char)X = (int)X >> N
I need to do the complementary left-shift forms

Sunday, November 28, 2010

Found another problem. Byte AND immediate operations were operating on the low byte. This resulted in odd behavior, and took a while to track down. All fixed now.

Friday, November 26, 2010

Now that Thanksgiving is over, I can get back to TI stuff. I'm using the LIBC project as a larger test program for GCC. Good thing too.

I found and removed a byte compare form which used a constant. I'm not sure how that got there. Copy-paste error maybe?

I've noticed that there's a lot of emitted code like this for int-to-char conversions:
mov r1, r2
mov r2, r1
swpb r1

And then R2 is never used again. I can do better than that, so an optimization has been added to handle this case. Now we get:
swpb r1

Much better. So now it's time for byte compare optimizations. I want to change:
li r2, >1200
cb r1, r2
jh LABEL
to:
ci r1, >12FF
jh LABEL

Wednesday, November 17, 2010

This is what I wrote for the patch release at AtariAge:

Well, it's patch time again.

Here's what made it into this release:

Bintils
Allow TI-style quotes ('example')
Allow two-byte character constants for immediate expressions (li r0, 'ab')
Fix a BFD Makefile bug which prevented clean compilation

GCC
Fix tms9900_output_ascii, was emitting invalid code when non-text characters were used
Divide and modulus operations now merged when possible
Fix data symbol declarations, now TI compliant
Fix "+=4" form, was missing comma in emitted code
Fix alignment of code, in some cases it was possible to misalign code by using odd-length string constants
Fix stack frame load/save differences, was using different locations between function prologue and epilogue in some cases (Thanks Tursi!)
Save return pointer at bottom of stack. This may help for later stack trace construction
Add optimizations for compare-and-branch operations with 16-bit values against -2, -1, 0, 1, and 2.

Right now I only have optimizations for equality tests with -2, -1, 1, and 2 done. To get inequality tests, I need to convince GCC to emit tests against the overflow flag. GCC has no concept of this kind of instruction, so I need to play with that a bit more.

The other weakness is the divide and modulus instructions. I haven't been able to convince GCC to use convenient registers for the source and destination. This means that in some cases, I need to insert additional MOVs which really shouldn't be necessary. More playing around required here too, I suppose.

I've addressed all the problems Tursi found earlier, plus a few others. Unfortunately, libiberty is not on that list. Since a lot of those routines are OS-specific, and since there is no POSIX-like interface for the TI, these functions are of limited use right now. In the future that might change. (hint, hint)

So here's the build procedure for everything. I've made sure these have been tested several times. There should be no problems following them.


Patching the original files:
$ cd binutils-2.19.1
$ patch -p1 < binutils-2.19.1-tms9900-1.1.patch

$ cd gcc-4.4.0
$ patch -p1 < gcc-4.4.0-tms9900-1.2.patch

Building binutils
$ ./configure --target tms9900 --prefix INSTALLDIR
$ make all
$ make install

Building GCC
$ ./configure --target=tms9900 --prefix=INSTALLDIR --enable-languages=c
$ make all-gcc
$ make install

Notice that GCC uses equals after the options, while binutils does not. Kind of annoying and easy to mix up. At this point, you will have all the GNU compilation tools ready to use for TI work. The binary format is ELF, since that stores the extra data needed by the linker and other tools. In earlier posts I've attached code to convert from ELF to TI-cart format. I've also got prototype converters for EA5 and EA3 formats too, but I haven't tested them very much.

When compiling with GCC, I recommend using the -O2 and/or -Os options to reduce the overall code size. Using the default options can result in extra wordy code with unnecessary or duplicate instructions.

There's still quite a bit of work left to do for GCC, so there will be more patches coming. I need to fill out the missing math support for 32-bit values, make sure signed multiply and divide work, and the other stuff mentioned above. I especially want to add more optimizations to the compiled output, but that will come as I get more familiar with what instruction patterns GCC likes to use.

Sunday, November 14, 2010

So I've make scripts to automate making patches. Not very fast, but that's OK for now.

I found and fixed the build problem in binutils. It turns out that there was a missing recipie for elf32-tms9900.lo in bfd/Makefile.in. Fixing this problem allows binutils to build without any problems.

For future reference:

patching:
$ cd {path_to_original_files}
$ patch -p1 < patchfile

binutils:
$ ./configure --target tms9900 --prefix /home/eric/dev/tios/toolchain/WORKSPACE/emw
$ make all
$ make install


GCC:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install

Tomorrow, I'll confirm that the build tools work as advertised and release.
The INC-type comprisons have been made and tested. I looked into getting JNO working, and found a good template in the Sparc archetecture. I decided against doing anything about that right now. I need to get a patch out now. New features can wait a bit.

I decided to run a test to make sure that everything was working before making patches, and I'm glad I did. Apparently, I made changes to GAS a while ago to allow TI-style constants. In the process, I broke processing for all other types of constants. The resulting binaries were unusable, and caused crashes and resets. Thst's been fixed, and all my testing looks OK, so I'm off to make patches.

More problems. The scripts I thought I had to make patches, are not complete. Even worse, I don't remember what the missing pieces were. That means I need to start over from scratch. Poop.

I also need to keep the blog site up-to-date and advertise it in an AtariAge sig. Not really necessary, but someone might be interested.

Thursday, November 11, 2010

No, I made the same mistake agin. I can't use INC and friends for comparison due to the possible overflow. I can only use those forms for equality tests. I should be able to do the other tests if I can use the JNO instruction. I don't have the motivation to add that right now.

Wednesday, November 10, 2010

Horray! All optimizations have been implemented and tested. I've also added the quicker comparisons for -2,-1,1,2.

So the objective now is to test this with a larger program, and make sure the generated code runs properly. Should be fine, though. Once that's complete, I need to make patches and a new build procedure.

By the way, use -Os to optimize for size. Handy for the TI.

Tuesday, November 9, 2010

You know, after implementing all the optimizations listed above, I realized something: neg(b1000...) == b1000...

This means I can't use NEG for the comparison test. Poop. I guess I'll have to lose the ~1 clock bonus of NEG. On the up side, that means a LOT fewer peepholes to test (even though it was all written and tested...)

So new plan.

Use ABS for all tests with dead registers, except for A>=0 tests. I can still use INV for that one.

Monday, November 8, 2010

Although replacing MOV with INV or NEG is faster for that single instruction, what is the impact for the overall sequence? Am I just getting wrapped up in all this for no real gains? Time to double check.

I need some shorthand for the compound jumps, so here are the cycle timings for each possible exit from a compund jump:
jlt: 4+(8..10) -> 14 = 14
jeq: 4+(8..10) -> 12+14 = 26
none: 12+12 = 24

Min and max timings for some instructions:
mov A, A -> 4+14 + (1..12)*2 = 20..42
inv A -> 4+10 + (1..12) = 15..26
neg A -> 4+12 + (1..12) = 17..28
abs A -> 4+(12..14) + (1..12) = 17..30

A<0 : mov A, A; jlt : (20..42)+(12..14) = (32..56)
A<=0 : mov A, A; jlt; jeq : (20..42)+(14..26) = (34..68)
A==0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A!=0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A>0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A>=0 : mov A, A; jlt; jeq : (20..42)+(14..26) = (34..68)

Proposed optimizaitons

A<0 : inv A; jgt; jeq : (15..26)+(14..26) = (29..52)
A<=0 : neg A; jgt; jeq : (17..28)+(14..26) = (31..54)
A==0 : neg A; jeq : (17..28)+(12..14) = (29..42)
A!=0 : neg A; jeq : (17..28)+(12..14) = (29..42)
A>0 : neg A; jlt : (17..28)+(12..14) = (29..42)
A>=0 : inv A; jlt : (15..26)+(12..14) = (27..40)

A<0 : abs A; jlt : (17..30)+(12..14) = (29..44)
A<0 : neg A; jgt : (17..28)+(12..14) = (29..42)

Saturday, November 6, 2010

After spellunking for months trying to get REG_DEAD notes into the compiled RTL, it turns out that they are not necessary anymore. Apparently this changes somewhere in the 3.X versions of GCC (I want to say 3.5, but I'm not sure about that. I read about this at work earlier, and I don't remember the details right now. Not really important now.)

I read a lot of posts from the GCC developers, and apparently, I shouldn't need to modify anything beyond the machine-dependant code to achieve everything I'm looking for. This is really good to know, since that should help reduce the time spent researching the GCC front end. Although, I'm kinda glad I did that work now.

So I'm going to implement the optimizations listed in September as peepholes. Should be pretty straightforward, really.

Repeating the optimization list from above:

Baseline:
mov Rx, Rx (14 cycles)

These all assume compared register will be dead
Compare to 2: dect G (10 cycles)
Compare to 1: dec G (10 cycles)
Compare to -1: inc G (10 cycles)
Compare to -2: inct G (10 cycles)

A<0 -> inv A; A>=0 (10 cycles) lt
A<=0 -> neg A; A>=0 (12 cycles) le
A==0 -> neg A; A==0 (12 cycles) eq x
A!=0 -> neg A; A!=0 (12 cycles) ne x
A>0 -> neg A; A<0 (12 cycles) gt
A>=0 -> inv A; A<0 (10 cycles) ge

lt (<)
le (<=)
eq (==)
ne (!=)
gt (>)
ge (>=)
ltu (< unsigned)
leu (<= unsigned)
gtu (> unsigned)
geu (>= unsigned)

I might not use the C pattern though..
Assume instructions are in slow mem, registers are fast

inct r1; inct r2 (4+10+1 + 4+10+1 = 30 cycles) %100
inc r1; inc r2 (4+10+1 + 4+10+1 = 30 cycles) %100

c *r1+, *r2+ (4+14 + 8+4 + 8+4 = 42 cycles) %140
cb *r1+, *r2+ (4+14 + 6+4 + 6+4 = 38 cycles) %126

So this form saves two bytes, but is about a third slower, and is difficult to induce. I think I'll pass on this.

Tuesday, November 2, 2010

That last frame problem has been solved. I had written that the frame pointer had a role in determining whether R11 should be saved. That was a mistake, one has nothing to do with the other. This was seen when optimization was off because the frame pointer is aways used at this optimization level.

I don't know if this was causing problems yet, but frame_pointer_needed and df_ever_alive() were being factored into the R11 save calculation as well. These are always set for R11 since it's the return pointer, but it only really needs to be saved if the function is not a leaf.