Insomnia Labs: 2010

Wednesday, December 29, 2010

Just for the heck of it I decided to try compilation at -O1, mostly to see what would happpen. What happened was disaster. I used the in-development printf.c file as a test, and I got this:

printf.c: In function ‘printf’:
printf.c:60: internal compiler error: in df_ref_record, at df-scan.c:2799
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
EMW>> df_ref_record : GET_CODE = 41

If -Os is used, there is no problem. Odd.

The code 41 line was added to help debug this mess. What I've fallen into is that a CONCAT RTX is being used where a register RTX is expected. the assertion fails, and I get the error above. This is during step 172r.ira. I think a CONCAT is a construct where a value is spread over several registers, but these are only supposed to show up during initial RTX generation. This seems a little late for concats, based on what I've read. Need more research.

Tuesday, December 14, 2010

I've revisited GAS to improve the error handling, we now get more helpful messages. The misleading "missing comma" error only shows up when a comma is actually missing.

The 16-bit compare recipie had the same "compare with general location" problem that I earlier found with 8-bit compares. Now fixed.

Saturday, December 11, 2010

I've found another FAKE_* register which pops up in movb instructions while using -O0. I think I've fixed it, but I need more testing. I also found this problem :

Unrecognized expression: "@4(r8),$25"
/tmp/cc5D59MT.s: Assembler messages:
/tmp/cc5D59MT.s:91: Error: bad expression
/tmp/cc5D59MT.s:91: Error: missing comma seperator

Context:
inc @6(r8)
movb r1, r1
jeq L8
ci @4(r8), >25 <-- problem line
jne L9

The problem is that CI only takes a register as argument 1.

I also need to take a look at GAS, the error message is misleading

Thursday, December 9, 2010

Today I got printf using a real format string. Right now it oly handles "%", "x" and "X", but that's not bad. Now that the format parser is in place, I can expand it using the other formats.

For now, I think I'll ignore printf's return value. It's supposed to return the number of bytes written, which isn't too bad, I just want to keep my job simple for a while first. Right now printf only takes up about 200 bytes, which isn't great, but that's from the compiler. Hand-optimizaton can improve that a bit. But that's for later.

More GCC stuff:
I found a bug in subhi3, it needs to handle arbitrary constants, like "ai r0, -5". That's been fixed.

Another bug was found in char-to-int casting. I get "sra r2, 8", then several dozen empty lines, then gibberish control codes. Not good. I'll fix that tomorrow.

It also looks like bit shifting is not in the list of GCC-recognized operations that modify the conition flags. Check and fix tomorrow

Wednesday, December 8, 2010

At this point, I can't see anything else which would make for good optimizations. So it's back to LIBC for me. I'm sure as the size of that codebase grows, more opprotunities will appear.

I've got a hex printing routine written in C, and it works great. Slightly bulkier than the assembly version I wrote earlier. Ultimately, I think I'll stick with the assembly routines. The C code is more like a real-world test than production code.

I've also confirmed that the variadic va_* functions work. Handy!

I've added a byte count to the cooked string output code. This will be used in printf.

Saturday, December 4, 2010

I found another optimization that's worth doing. During initialization of local byte arrays, this sequence is emitted:

Bytes Clocks
li r2, >30 * 256 4 12+4
movb r2, @2(r10) 4 14+4
li r5, >78 * 256 4 12+4
movb r5, @3(r10) 4 14+4
--- ----
16 68

I can squeeze out a few bytes by doing this instead:

Bytes Clocks
li r2, >30 * 256 + >78 4 12+4
movb r2, @2(r10) 4 14+4
swpb r2 2 10
movb r2, @3(r10) 4 14+4
--- ----
14 62

This is about a ten percent gain in time and space, which isn't too bad. Unfortunately, I can't figure out a way to implement this now. I may have to come back to this later.

In the initial implemenation of the right-shift-and-cast operations only dealt with shift offsets greater than eight, but I realized I can be more general.

Here's a truth table for all possible right shifts and the equvalant general shift I need, including casting.

N Original pattern Shifted pattern Result Optimization
- ----------------- ----------------- -------- ----
0 01234567.89ABCDEF -> 01234567.89ABCDEF -> 89ABCDEF X<<8 -.
1 01234567.89ABCDEF -> x0123456.789ABCDE -> 789ABCDE X<<7 |
2 01234567.89ABCDEF -> xx012345.6789ABCD -> 6789ABCD X<<6 |
3 01234567.89ABCDEF -> xxx01234.56789ABC -> 56789ABC X<<5 | X<<(8-N)
4 01234567.89ABCDEF -> xxxx0123.456789AB -> 456789AB X<<4 |
5 01234567.89ABCDEF -> xxxxx012.3456789A -> 3456789A X<<3 |
6 01234567.89ABCDEF -> xxxxxx01.23456789 -> 23456789 X<<2 |
7 01234567.89ABCDEF -> xxxxxxx0.12345678 -> 12345678 X<<1 -'
8 01234567.89ABCDEF -> xxxxxxxx.01234567 -> 01234567 nop
9 01234567.89ABCDEF -> xxxxxxxx.x0123456 -> x0123456 X>>1 -.
A 01234567.89ABCDEF -> xxxxxxxx.xx012345 -> xx012345 X>>2 |
B 01234567.89ABCDEF -> xxxxxxxx.xxx01234 -> xxx01234 X>>3 |
C 01234567.89ABCDEF -> xxxxxxxx.xxxx0123 -> xxxx0123 X>>4 | X>>(N-8)
D 01234567.89ABCDEF -> xxxxxxxx.xxxxx012 -> xxxxx012 X>>5 |
E 01234567.89ABCDEF -> xxxxxxxx.xxxxxx01 -> xxxxxx01 X>>6 |
F 01234567.89ABCDEF -> xxxxxxxx.xxxxxxx0 -> xxxxxxx0 X>>7 -'

Tuesday, November 30, 2010

Added optimizations for (char)X = (int)X >> N
I need to do the complementary left-shift forms

Sunday, November 28, 2010

Found another problem. Byte AND immediate operations were operating on the low byte. This resulted in odd behavior, and took a while to track down. All fixed now.

Friday, November 26, 2010

Now that Thanksgiving is over, I can get back to TI stuff. I'm using the LIBC project as a larger test program for GCC. Good thing too.

I found and removed a byte compare form which used a constant. I'm not sure how that got there. Copy-paste error maybe?

I've noticed that there's a lot of emitted code like this for int-to-char conversions:
mov r1, r2
mov r2, r1
swpb r1

And then R2 is never used again. I can do better than that, so an optimization has been added to handle this case. Now we get:
swpb r1

Much better. So now it's time for byte compare optimizations. I want to change:
li r2, >1200
cb r1, r2
jh LABEL
to:
ci r1, >12FF
jh LABEL

Wednesday, November 17, 2010

This is what I wrote for the patch release at AtariAge:

Well, it's patch time again.

Here's what made it into this release:

Bintils
Allow TI-style quotes ('example')
Allow two-byte character constants for immediate expressions (li r0, 'ab')
Fix a BFD Makefile bug which prevented clean compilation

GCC
Fix tms9900_output_ascii, was emitting invalid code when non-text characters were used
Divide and modulus operations now merged when possible
Fix data symbol declarations, now TI compliant
Fix "+=4" form, was missing comma in emitted code
Fix alignment of code, in some cases it was possible to misalign code by using odd-length string constants
Fix stack frame load/save differences, was using different locations between function prologue and epilogue in some cases (Thanks Tursi!)
Save return pointer at bottom of stack. This may help for later stack trace construction
Add optimizations for compare-and-branch operations with 16-bit values against -2, -1, 0, 1, and 2.

Right now I only have optimizations for equality tests with -2, -1, 1, and 2 done. To get inequality tests, I need to convince GCC to emit tests against the overflow flag. GCC has no concept of this kind of instruction, so I need to play with that a bit more.

The other weakness is the divide and modulus instructions. I haven't been able to convince GCC to use convenient registers for the source and destination. This means that in some cases, I need to insert additional MOVs which really shouldn't be necessary. More playing around required here too, I suppose.

I've addressed all the problems Tursi found earlier, plus a few others. Unfortunately, libiberty is not on that list. Since a lot of those routines are OS-specific, and since there is no POSIX-like interface for the TI, these functions are of limited use right now. In the future that might change. (hint, hint)

So here's the build procedure for everything. I've made sure these have been tested several times. There should be no problems following them.

Patching the original files:
$ cd binutils-2.19.1
$ patch -p1 < binutils-2.19.1-tms9900-1.1.patch

$ cd gcc-4.4.0
$ patch -p1 < gcc-4.4.0-tms9900-1.2.patch

Building binutils
$ ./configure --target tms9900 --prefix INSTALLDIR
$ make all
$ make install

Building GCC
$ ./configure --target=tms9900 --prefix=INSTALLDIR --enable-languages=c
$ make all-gcc
$ make install

Notice that GCC uses equals after the options, while binutils does not. Kind of annoying and easy to mix up. At this point, you will have all the GNU compilation tools ready to use for TI work. The binary format is ELF, since that stores the extra data needed by the linker and other tools. In earlier posts I've attached code to convert from ELF to TI-cart format. I've also got prototype converters for EA5 and EA3 formats too, but I haven't tested them very much.

When compiling with GCC, I recommend using the -O2 and/or -Os options to reduce the overall code size. Using the default options can result in extra wordy code with unnecessary or duplicate instructions.

There's still quite a bit of work left to do for GCC, so there will be more patches coming. I need to fill out the missing math support for 32-bit values, make sure signed multiply and divide work, and the other stuff mentioned above. I especially want to add more optimizations to the compiled output, but that will come as I get more familiar with what instruction patterns GCC likes to use.

Sunday, November 14, 2010

So I've make scripts to automate making patches. Not very fast, but that's OK for now.

I found and fixed the build problem in binutils. It turns out that there was a missing recipie for elf32-tms9900.lo in bfd/Makefile.in. Fixing this problem allows binutils to build without any problems.

For future reference:

patching:
$ cd {path_to_original_files}
$ patch -p1 < patchfile

binutils:
$ ./configure --target tms9900 --prefix /home/eric/dev/tios/toolchain/WORKSPACE/emw
$ make all
$ make install

GCC:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install

Tomorrow, I'll confirm that the build tools work as advertised and release.

The INC-type comprisons have been made and tested. I looked into getting JNO working, and found a good template in the Sparc archetecture. I decided against doing anything about that right now. I need to get a patch out now. New features can wait a bit.

I decided to run a test to make sure that everything was working before making patches, and I'm glad I did. Apparently, I made changes to GAS a while ago to allow TI-style constants. In the process, I broke processing for all other types of constants. The resulting binaries were unusable, and caused crashes and resets. Thst's been fixed, and all my testing looks OK, so I'm off to make patches.

More problems. The scripts I thought I had to make patches, are not complete. Even worse, I don't remember what the missing pieces were. That means I need to start over from scratch. Poop.

I also need to keep the blog site up-to-date and advertise it in an AtariAge sig. Not really necessary, but someone might be interested.

Thursday, November 11, 2010

No, I made the same mistake agin. I can't use INC and friends for comparison due to the possible overflow. I can only use those forms for equality tests. I should be able to do the other tests if I can use the JNO instruction. I don't have the motivation to add that right now.

Wednesday, November 10, 2010

Horray! All optimizations have been implemented and tested. I've also added the quicker comparisons for -2,-1,1,2.

So the objective now is to test this with a larger program, and make sure the generated code runs properly. Should be fine, though. Once that's complete, I need to make patches and a new build procedure.

By the way, use -Os to optimize for size. Handy for the TI.

Tuesday, November 9, 2010

You know, after implementing all the optimizations listed above, I realized something: neg(b1000...) == b1000...

This means I can't use NEG for the comparison test. Poop. I guess I'll have to lose the ~1 clock bonus of NEG. On the up side, that means a LOT fewer peepholes to test (even though it was all written and tested...)

So new plan.

Use ABS for all tests with dead registers, except for A>=0 tests. I can still use INV for that one.

Monday, November 8, 2010

Although replacing MOV with INV or NEG is faster for that single instruction, what is the impact for the overall sequence? Am I just getting wrapped up in all this for no real gains? Time to double check.

I need some shorthand for the compound jumps, so here are the cycle timings for each possible exit from a compund jump:
jlt: 4+(8..10) -> 14 = 14
jeq: 4+(8..10) -> 12+14 = 26
none: 12+12 = 24

Min and max timings for some instructions:
mov A, A -> 4+14 + (1..12)*2 = 20..42
inv A -> 4+10 + (1..12) = 15..26
neg A -> 4+12 + (1..12) = 17..28
abs A -> 4+(12..14) + (1..12) = 17..30

A<0 : mov A, A; jlt : (20..42)+(12..14) = (32..56)
A<=0 : mov A, A; jlt; jeq : (20..42)+(14..26) = (34..68)
A==0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A!=0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A>0 : mov A, A; jeq : (20..42)+(12..14) = (32..56)
A>=0 : mov A, A; jlt; jeq : (20..42)+(14..26) = (34..68)

Proposed optimizaitons

A<0 : inv A; jgt; jeq : (15..26)+(14..26) = (29..52)
A<=0 : neg A; jgt; jeq : (17..28)+(14..26) = (31..54)
A==0 : neg A; jeq : (17..28)+(12..14) = (29..42)
A!=0 : neg A; jeq : (17..28)+(12..14) = (29..42)
A>0 : neg A; jlt : (17..28)+(12..14) = (29..42)
A>=0 : inv A; jlt : (15..26)+(12..14) = (27..40)

A<0 : abs A; jlt : (17..30)+(12..14) = (29..44)
A<0 : neg A; jgt : (17..28)+(12..14) = (29..42)

Saturday, November 6, 2010

After spellunking for months trying to get REG_DEAD notes into the compiled RTL, it turns out that they are not necessary anymore. Apparently this changes somewhere in the 3.X versions of GCC (I want to say 3.5, but I'm not sure about that. I read about this at work earlier, and I don't remember the details right now. Not really important now.)

I read a lot of posts from the GCC developers, and apparently, I shouldn't need to modify anything beyond the machine-dependant code to achieve everything I'm looking for. This is really good to know, since that should help reduce the time spent researching the GCC front end. Although, I'm kinda glad I did that work now.

So I'm going to implement the optimizations listed in September as peepholes. Should be pretty straightforward, really.

Repeating the optimization list from above:

Baseline:
mov Rx, Rx (14 cycles)

These all assume compared register will be dead
Compare to 2: dect G (10 cycles)
Compare to 1: dec G (10 cycles)
Compare to -1: inc G (10 cycles)
Compare to -2: inct G (10 cycles)

A<0 -> inv A; A>=0 (10 cycles) lt
A<=0 -> neg A; A>=0 (12 cycles) le
A==0 -> neg A; A==0 (12 cycles) eq x
A!=0 -> neg A; A!=0 (12 cycles) ne x
A>0 -> neg A; A<0 (12 cycles) gt
A>=0 -> inv A; A<0 (10 cycles) ge

lt (<)
le (<=)
eq (==)
ne (!=)
gt (>)
ge (>=)
ltu (< unsigned)
leu (<= unsigned)
gtu (> unsigned)
geu (>= unsigned)

I might not use the C pattern though..
Assume instructions are in slow mem, registers are fast

inct r1; inct r2 (4+10+1 + 4+10+1 = 30 cycles) %100
inc r1; inc r2 (4+10+1 + 4+10+1 = 30 cycles) %100

c *r1+, *r2+ (4+14 + 8+4 + 8+4 = 42 cycles) %140
cb *r1+, *r2+ (4+14 + 6+4 + 6+4 = 38 cycles) %126

So this form saves two bytes, but is about a third slower, and is difficult to induce. I think I'll pass on this.

Tuesday, November 2, 2010

That last frame problem has been solved. I had written that the frame pointer had a role in determining whether R11 should be saved. That was a mistake, one has nothing to do with the other. This was seen when optimization was off because the frame pointer is aways used at this optimization level.

I don't know if this was causing problems yet, but frame_pointer_needed and df_ever_alive() were being factored into the R11 save calculation as well. These are always set for R11 since it's the return pointer, but it only really needs to be saved if the function is not a leaf.

Friday, October 29, 2010

I found another problem with the stack frame stuff. Poop. I noticed with Tursi's test code that the epilogue was popping one too many words. It turns out that the epilogue code assumes that if there is any stack to save, the return pointer will be saved, and so one more word needs to be popped. This has been fixed.

Fixed the mistaken leaf-ness mentioned above. The variable current_function_is_leaf is evaluated late, but the function which drives this, leaf_function_p(), appears to be valid some time earlier.

Uggh, not another problem... When no optimization is called for, the frame pointer is used to store local values. Unfortunately, the current mapping between the frame and stack pointers is wrong, and results in bad addresses being calculated for local values. There is zero offset between stack and frame, and locals are indexed off frame. No provision is made for the saved registers on the bottom of the stack.

Thursday, October 28, 2010

I've been working on lifetime calculation for REG_DEAD notes, but I got a message from the AtariAge forums. Tursi was trying to use the compiler, and found some stack layout problems. I got a chance to look at that today. I've found three problems, and fixed two of them.

In one of the prologue forms, the location of the saved registers was mistakenly calculated to be at the top of the stack. This is the only place where that assumption was made.

In the event of a call frame being needed without saved registers, no space was being allocated for the frame registers, The epilogue was fine in this case, and would result in a crash somewhere down the line.

The last problem is that the leaf-ness of a function seems to be calculated after tms9900_starting_frame_offset is called. This means that the frame offset calculation assumes that the link register needs to be saved, and leaves space for it. However, when the prologue is called, we know that the function is a leaf, and no space is saved for the frame, and stack corruption results. I need to find a way to check for leaf-ness earlier in the function construction. Somehow.

Monday, September 13, 2010

The optimizations would be nice, but the REG_NOTEs are not always (alright, hardly ever) present to select the optimizations above. So I need to get more invasive. I've made this test program to test the "neg; jeq" optimization.

extern int func();
int top()
{
int a = func();
if(a == 0)
{
func();
}
return(0);
}

Check out the debug output files, and track down where the live register usage is tracked. I should be able to root around in the internals to properly check for valid cases.

Saturday, September 11, 2010

I've just finished the updated string constant handling in GAS, it's now TI compliant, and looks pretty good. Of course, more testing is required. So now, it's time to get the compiler optimizations in.

So here's what I've got:

These all assume compared register will be dead
Compare to 2: dect G (10 cycles)
Compare to 1: dec G (10 cycles)
Compare to 0: abs G (12/14 cycles)
Compare to -1: inc G (10 cycles)
Compare to -2: inct G (10 cycles)

I could use neg (12 cycles), and inverted comparisons.

c *r1+, *r2+ to increment two registers at the same time. Hard to make a test for this...

A<0 : inv A; jgt
A<=0: << no optimization >>
A==0: neg A; jeq
A!=0: neg A; jne
A>0 : neg A; jlt
A>=0: inv A; jlt

Monday, September 6, 2010

I've found a quick test for zero compare from AtariAge: "abs Rn". This is two clocks faster than the "mov Rn, Rn", but destroys the value. In order to use this, I'll need to check to make sure the register is no longer used. A similar test would be "inv Rn", this saves four clocks.

Another thing I've seen is to use "cmp *Ra+, *Rb+" to increment two registers at the same time. That could be an intersting optimization, but hard to use.

Monday, August 30, 2010

So I've finished up the string.h functions, and I'm piecing together the printf forms (again). Sadly, this will be the third time I've done this.

I've also found another problem with the string constants. Embedded quotes cause problem with the assembler. Here's some notes from the E/A manual about character constants:

'A' -> 0x41
'AB' -> 0x4142
'''D' -> 0x2744

Text strings are contained within single quotes, single quotes escaped by duplication.

I want to support TI-style quotes, and make the compiler use those, but allow the assembler to use TI or C style strings.

Sunday, August 22, 2010

I've been busy since the last update working on libc functions. I think I'm about 20% to 30% done with that so far. Sadly, I've found two more problems in the compiler.

The first was a missing comma in the 16-bit "+=4" optimization. That was just a dumb mistake, easily fixed.

The second was a bit trickier. When using constant strings, it was possible to cause a code misalignment, resulting in a non-working image. The problem here was that the ASM_OUTPUT_ALIGN macro, which I copied from some other archetecture, was no good for the TMS9900. I needed two-byte alignemts, but the macro ignored all alignments less than four. This effectively turned off all code alignment, which caused this problem. All working now. Yay!

Monday, August 16, 2010

Control code handling for screen output is complete. Now, all control codes are silently handled as part of screen_write_string. There is an easy way to add a hook later for raw vs. cooked screen mode. That would allow selective handlaing of control code processing. I've also managed to squeeze out 26 bytes from that function. Nothing exciting, but it makes me happy, and I suppose every byte counts.

The plan from here is to get printf finished, and move on to other stuff.

Thursday, August 12, 2010

OK, I'm officially done with DIV stuff, but I've learned a few things while working on this. I tried everything I could think of to try to get the register location to work properly on argument 1 and the outputs. No luck. Every attempt using subregs, or direct register assignment resulted in working code, but seperate blocks for DIV and MOD. Apparently, the optimizer is not smart enough to deal with this properly. If any modification is done to argument 1, the blocks get split again. Also, I can't find a way to specify one argument as a subreg of another argument.

So, in order to make the best of this situation, I'm using a form which does not tie output registers to argument one. This allows the optimizer to group DIV and MOD operations with common terms. On the other hand, this allows the outputs to be located in inconvenient registers. Code has been added to move the results to the registers selected by the compiler. In the worst case, the DIV and MOD results are in the opposite registers (MOD result rgister chosen to hold DIV, and vice versa). This case is handled by swapping values using XOR, which eliminates the need for a temp register, but is no faster than just using MOVs. In a perfect world, this work would be unnecessary, but this is better than having seperate DIV and MOD blocks. With careful ordering in the C code, the compiler can be encouraged to use the correct registers, omitting all the extra MOVs.

So now, unless something else pops up, I can put the compiler to rest, and continue with the LIBC code.

Sunday, August 8, 2010

It's taken way, way too long, but I've finally got some optimization for the DIV instruction. There's the possibility for two extra MOVs which might be optimized away, but this is way better than seperate DIV and MOD calculations. Now that I've finally got this working, I can move on to something else.

Somehow there are ".comm" directives in the output. This must be fixed.

Sunday, July 25, 2010

Wow, it's been a while since I updated this. So I released a new set of patches, with really limited response. This makes me sad.

I've started making a C library to make development easier. This will also exercise the compiler a bit more.

I've noticed an annoying aspect of GCC, the divmod form is invoked for each div and mod operation. Even though the processor can compute both values in a single instruction, GCC does not take advantage of this. I found a comment on the GCC development forums where this is a problem for many archetectures.

Unfortunately, I'm not eager to change the internals of the compiler to accomodate the TMS9900. Even if it would make for better code, it would likely break forward compatibility.

Oh well.

Tuesday, June 29, 2010

I've noticed that R11 (return pointer) is incorrectly marked as a non-volatile register. This has been fixed, and R11 is still saved and resored as it should be. I've also doen a little cleanup.

I've also fixed all the bugs I can think of, so tomorrow would be a good time for another release.

Monday, June 28, 2010

All the stack stuff has now been fixed. Saved registers are pushed last onto the stack.

I've run the numbers on stack vs. nonvol reg usage. It turns out that it is better to use values in the stack if a nonvol value is used two or fewer times. For three or more ocurrances, it is better to use registers. Unfortunately, there is no good way to find the usage frequency or find out if it's used in a loop. Some of this info (usage count, not loop info) is stored in GCC, but not in a useful way. So I guess I'll just have to use non-volatiles as frequently as possible

Sunday, June 20, 2010

The AtariAge people were concered about stack performance, and I was second-gussing myself, so I did cycle timings for 7 or eight different ways of stack setup and teardown. I'm glad I did his, since I found a form which saves 2 bytes and 12 cycles per call. I also found that there are problems with the current code.

For small nvreg usage, the old values are stored at the bottom of the stack. For large usage, the old values are stored a the top of the stack. The epiloge assumes old values are at the top of the stack. The stack offsets assume old values are at the top of the stack.

The faster form has the old values at the bottom of the stack. If I keep the old smaller form that means that the epilogue and the stack offset calcs will need to be changed. Ick. And the port documentation. Poo.

Also, stack-vs-nvreg calculations show that using the stack is smaller for less than three uses per function. It's faster too for less than five uses per function. So the allocator needs fixing too. Bah.

Thursday, June 17, 2010

I've been super busy with work for the past week, so not a whole lot of TI work. I've updated the GCC port document, and annotized some examples.

During this work, I noticed a problem with function with arguments like (long, long, int, long). The last argument, which would spilll over into R7, is being lost. I'm not sure what's going on here.

The other thing I noticed is that non-volatiles were not bring used to store locals, everything was going to the stack. This seems to be a register cost balancing problem. I need to do the math to see where the break-even point is for non-vol versus the stack. This could be tricky.

Sunday, June 13, 2010

I've submitteed patches to atariage, but I screwed up the patch file. I also noticed some bugs. For example, the data types were .byte and .short instead of byte and data. There were also some incompatibilities from the TI specs for the *si patterns. I've also chnaged the constants to use hex values instead of decimal values, this will make things easier to read.

So here's how patches are made:

diff -rupN {original} {modified} > patchfile

And to aply this file:

cd {path_to_original_files}
patch -p1 < patchfile

I now need to write docmumentation for the ABI, calling convention, and stack details.

Tuesday, June 8, 2010

GCC and Binutils patches

I wanted to post links to these patches on AtariAge, but their forum rules prevent creating a new topic which would make sense for this. So, I'm doing that here.

These patches will port the GNU tools to the TMS9900 processor. They use ELF-format for the object files. I have other tools to convert the resulting ELF executable to TI cartridge and memory image (EA5) formats. I don't have a converter for the TI linkable (EA3) format yet, but I'll probably get around to it. This is not a priority for me since I'm planning to use my own loader and do not intend to link to the TI routines anytime soon.

To the patches!

gcc-4.4.0-tms9900-1.0-patch.tar.gz

binutils-2.19.1-tms9900-1.0-patch.tar.gz

Saturday, June 5, 2010

New site up and running

So after a whole lot of procrastination, I've finally put a web site together for the work I've been doing. This will mostly be for the TI code, but I'll post other stuff I'm working on from time to time.

Since older posts were written well before this blog was built, and written mostly for my own use, some explanation is called for.

For years, I've tried to come up with an interesting way to use my old TI99/4A, and eventually came up with the idea of rewriting the firmware to give more capabilities to the old hardware. I thought using some Unix-like concepts would be an interesting challenge. Things like multi-user or multi-tasking support are strangely lacking in a home computer from the early eighties.

The idea is to eventually rewrite everything: firmware, disk formats, executable formats, the works. This will probably be a very long process, and will eliminate compatibility with authentic TI software, but it should be a lot of fun.

Monday, May 31, 2010

Both V9T9 disk image tools are complete. The dump tool has been verified with known-good images, and the image composer has been verified with the TI Disk Manager cartridge. Now I need to make a EA3 and ES5 conversion tools. At that point, I really have no excuse but to do some useful work. I've been taking a huge detour from working on a replacement OS for the TI. Still, these are useful tools.

More helpful notes for later: Apparently, loading files using Editor/Assembler is case-sensitive.

Load a file using a name like "DSK1.PROGRAM"

Well, that was really easy... The first draft of the EA5 converter is done and works. I'm cheating a little bit here, since I'm assuming that the entry point is at the start of the .text section. I should be searching for the location of the "_start" label, but this should work for now.

EA5 images are assumed to load at 0xA000, so modify the makefile accordingly.

Sunday, May 30, 2010

I've got a V9T9 image reader almost done. I need this in order to test the image writer I'll be writing later.

Saturday, May 22, 2010

I've completed the last of the GAS changes, so now its back to GCC to make sure its output is TI compliant.

OK, now GCC is complete too. GCC and GAS work well together, all the test programs I've compiled behave as expected. I'm still disappointed that I couldn't modify GAS to be smater about what is a label to allow instructions to be in column one. I'm sure I'll get back to that at some point.

I suppose the next thing to do is try to see how the floppy works in MESS, make a disk image composer, then a ELF to EA3 and EA5 converter. Sounds simple, no?

To start with, MESS prefers V9T9 disk images, this also seems to be the most common format seen in the wild. Seems a good place to start.

Also a good thing to know is that the "flop1" option configures the disk image to use.

sdlmess0130/src/mess/tools/imgtool/modules/ti99.c

Thursday, May 20, 2010

I made some more changes to GAS to allow TI-style labels without colons. I happen to like colons, since it makes them easier to recognize. So GAS now supports both styles, but for now, statements cannot start in column one. This is kind of hacky, and I don't like it, but this can be changed later.

Tuesday, May 18, 2010

Finished more GAS changes today. I've got missing arguments to generate an error, it works nicely now. I also made changes to treat everything after valid arguments as a comment.

Monday, May 17, 2010

I've spent the day looking at the TI tagged object file. The details are in the Editor/Assembler manual. It's a simple, but horrendously wordy format. There are smaller forms, namely the compressed object and memory image formats, but I have no details on those.

I was interested in getting an idea of how to make disk-based programs using the GNU tools, but it looks like I'll have a challenge ahead of me here. If nothing else, I need to get a copy of the E/A cartride, although Extended Basic should work too.

So back to GAS then...

Todo list for GAS changes:
PSEG and friends in addition to ".text" [DONE]
BES directive
$ as current address [DONE]
Labels without ":"
Error if arguments are missing
Treat all after last argument as comment
Allow "*" comment symbol [DONE]

Sunday, May 16, 2010

I've just verified the toolchain with a "hello world" test. The resulting image works fine, and looks good. I still haven't fixed up the assembler for TI conventions yet, that's coming up next.

I've really got to get a website up and running so I can publish this stuff.

Saturday, May 15, 2010

I'm pretty happy with GCC for now. It could probably be made more efficient and have a few more optimization, but all my tests show that it works well. I've just finished cleaning out most of the debug and test code, and the code tree is mostly ready for release. I still need to exercice GCC with a more ambitious project, make a C library, and make the output conform to mostly TI assemly format.

So now I need to go back to GAS and make some more changes there. Namely, the TI format stuff, as well as more error checking for missing or extra arguments. I've occasionally forgot to add a count in shift instructions, and GAS did not catch it.

Here's a more correct build process for everything. GCC now automatically calls GAS, which is nice. I'm not happy about the seperate BFD make step for GAS, but I'll see what I can do there.

GAS:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target tms9900
$ cd bfd
$ make all
$ cd ..
$ make all
$ make install

GCC:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install

Thursday, May 6, 2010

I've added new instructions to do byte-to-word conversiion, this will hopefully end the FAKE_Rn register usages which pop up from time-to-time. This will also act as an optimization step, since there was a lot of copies to and from temporary registers during the course of subreg conversions. Unfortunately, these instructions conflict with the extendqihi instructions.

So, an example:

int* reg = (int*)0x8c00;
void a(int c) {*reg = (char)c;}

with extendqihi2:
mov r3, r1
swpb r1
movb r1, r2
sra r2, 8
mov r2, @reg
b *r11

without extendqihi2:
mov r3, r1
sla r1, 8
sra r1, 8
mov r1, @reg
b *r11

I think I can get better code by not using the "extend" instructions. If nothing else, it should make the .md file shorter.

Monday, May 3, 2010

I think I've got a compiler which is functional enough to do some work. The next step is to make sure the installation works and is useful. I'm realizing it's been a really long time since I last did a full GCC build, so I'm recording the steps I used for later:

GAS:
$ ./configure --target tms9900

GCC:
$ ./configure --program-prefix=tms9900 --prefix /home/eric/dev/tios/toolchain/bin --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install

Thursday, April 29, 2010

Well, this is anticlimactic. It turns out that the register allocator prohibits the use of volatile registers for user-defined variables if the optimization level is one or less (see ira-conflicts.c, ira_build_conflicts). Since I've been doing my tests using "-O1" optimization... mystery solved. And it only took a few days to figure that out.

Monday, April 26, 2010

The missing epilogue problem was caused by the existance of the "return" pattern. By defining this, the epilogue was not always used. So I removed that pattern, but had a new problem, how to return to the caller at the end of the epilogue. I looked at other archetectures, but found no useful pattern which would work for the TI.

Manually emitting instructions into the output stream did not work, since the outputted instructions did not appear in the right place, and would have resulted in non-functional code. I could not find a good way to make a RTX expression to use the existing branch instructions for the "b *lr" instruction.

What I ended up with was to create a fake hard PC register, and used a special form of "movhi" to emit the return instruction. This seems to work, but I'm concerned what will happen if GCC tries to use the fake PC register for actual work.

Monday, April 19, 2010

I've got the stack working properly, but my "hello world" program is being wierd. It's allocating R9 unexpectedly and does not include a function epilogue. I've started looking at the debug output, but I'm out of time for now.

So for later edification, start looking at emw.c.172r.ira for register allocation. The epilogue is in emw.c.178r.pro_and_epilogue

Thursday, April 15, 2010

So, I'm looking at arguments passed on the stack, so here's some random notes.

My test function:
void a()
{
zprintf(91,92,93,94,95,96,97,98);
}

resulting assembly:
ai r10, -6
mov r11, @4(r10)

li r1, 97 --.
mov r1, *r10 | Push arguments to the stack
li r1, 98 |
mov r1, @2(r10) --'

li r1, 91
li r2, 92
li r3, 93
li r4, 94
li r5, 95
li r6, 96
bl @zprintf
ai r10, 4
mov *r10+, r11
rt

So the stack looks like this in zprintf:
[ volatiles saved by A
[ A's frame
[ zprintf stack arguments
[ volatiles saved by zprintf
[ zprintf's frame
stack pointer

on the callee side, stack arguments are indexed as if from address zero.
need to fix sizes so ELIMINABLE_REGS works for arg-to-stack calculations

Wednesday, April 14, 2010

I've noticed that the function prologue and epilogue is needed to set up the stack, and to save off the non-volatile registers. I've come up with these forms for the prologue, depending on the number of registers to save off:

Form 1:
ai sp, -regsize cycles:14+0 bytes:4
mov reg, *sp+ 14+8 2
...
ai sp, -regsize-framesize 14+0 4

in general, bytes =8+2N: 10,12,14,16,18,20
cycles=28+22N

Form 2:
ai sp, -regsize-framesize cycles:14+0 bytes:4
mov reg, *X(sp) 14+8 4
...

in general, bytes =4+4N: 8,12,16,20,24,28
cycles=14+22N

So use form 1 only when we have three or more registers to save.

The epilogue is the same for both forms:

ai sp, framesize
mov reg, *sp+
...

The plan is to not use a frame pointer, or save off the stack pointer as part of a call. This saves us a ton of space over the course of a program since we save four bytes on the stack, and at least four instructions per function. The drawback of this design is that we lose the ability to use "calloc" or derive a call tree during debugging. I don't like this, but it's a good tradeoff.

I tried adding parallel CC0 checks in some instructions, but it just made a mess of the resulting code (about 2-3 times bigger, lots of redundant moves). I'll come back to this later.

I also checked for arguments passed on the stack. Looks like work needs to be done on the caller and callee side. Poop.

Tuesday, April 6, 2010

At this point, I think I've got instruction generation pretty much complete. There is still some ugliness when converting between words and bytes, but in order to get better code, I would need a lot of peepholes. I'm not prepareed to do that level of effort right now. Pretty good results can be had by writing optimization-friendly C code. For example, demote data values as early as possible, promote as late as possible. In a lot of cases, I cannot make better assembly by hand than what GCC outputs. I've been suprised by how good the output looks.

I still want to confirm that the stack is used correctly, I haven't checked that for a while. I also want to ensure that the assembler can use standard TI conventions. Also, the GCC code needs to be cleaned up, since it's currently full of debug code and commented-out experiments.

I also need to get a real blog together, since I have a full GNU toolchain working that other people may be interensted in.

By the way, I've successfullly compiled and tested a "hello world" C program. I'm happy with how easy it was to put together, but there is still something going on when "main" is invoked. GCC wants to add a call to "__main" at the start of "main". I'm not sure why this is.

Other things I've been looking into is the ea5 disk format. I'd like to have a tool to convert an ELF file to either a cart or disk format as desired. Also, right now the GROM cart header must be added by special assembly code. This is OK, but I would rather have a tool to add this.

Oh, I also need to get the condition register updates added to the machine description file.

Friday, March 5, 2010

I think I figured out the word-to-byte conversion. Basically I'll be lying to GCC. Instead of 16 16-bit registers, I'm telling GCC we have 32 8-bit registers. This allows the truncate formats to work as expected. It turns out GCC doesn't know how to truncate a hard register, and apparently assumes the low order bits are always in the low byte, regardless of the mode that register is used with. That results in the wrong instructions being used.

There is another problem with optimization, though. the following program is reduced to a NOP when using -O2:

char func(int a)
{return(a);}

looking at the debug output, it seems that the instructions are all removed by the point that the *.159r.combine file is generated. I'll need to look into this later.

I've been lurking the AtariAge forums for a while now and recently found the Editor/Assembler manual there. I wish I had this earlier, that would have make things a lot easier. All the scaps of information I had to reverse-engineer or infer from a bunch of places are listed in detail and in a pretty clean format. The only drawback is that it's a scanned copy, not OCR, so no text searches. Oh well.

Monday, February 22, 2010

I've now hit the annoying part of the GCC port. I'm pretty much done with everything except word-to-byte conversions, multiply, divide and elimination of the frame pointer register. I'm afraid I might have to do a lot of trial-and-error since I haven't found a template I can use for the remaining instructions.

I'm currently working on the word-to-byte conversion. GCC wants to use a "subreg" expression to do this, and eventually uses "movqi" regardless of the other, more exact instruction matches. There is a "truncMN2" format which should be used for this, but it's not being used. Poop.

Sunday, February 14, 2010

I have most of the 16 and 8 bit operations working now. I'm missing mul, div, type conversion, move and set. But it looks pretty good.

I've started looking at 32-bit operations, and that is not going smoothly. The compiler bails at init_move_cost.

UPDATE:
That was caused by not respecting the difference between Rn, *Rn, and @x(Rn). The instruction was assuming a Rn, when a *Rn was actually called for. I need to update all "o" condtions with "R" and "Q" and respect the consequences of each.

Sunday, February 7, 2010

Since I last updated the worklog, I've been working on a GCC port for the TMS9900. I made up my mind on this after I realized the ton of work required to maintain the source code if it all stays as assembly. Additionally, if I fool myself into thinking anyone else would be interested in using this thing when I'm done, C programming would be much more inviting.

So here's where I'm at right now:

I've got GCC sort of ported at the moment, the TMS9900 is a valid target, the call interface is working, as are the shift instructions. I'm currently working on the conditional jumps.

Out of curiosity, I looked at Tursi's port, and it looks like he just used the PDP11 stuff, but changed the opcode names. Works OK I suppose, but suboptimal.

I used concepts from the ARM, M68HC11 and PDP11 implementations, since I wasn't sure where to start. As a result, I've had a heck of a time with the call interface, since the pieces I used didn't play well together. Tracking down the bugs is challenging, since there is a lot going on and it's still not clear to me how all the machine-specific parts get used.

I want to have parameters passed in the registers, and am trying to keep the file "tios_abi.txt" current regarding the internals of the implementation.

Back to work...

Insomnia Labs