Monday, July 25, 2011

On the AtariAge forums, lucien2 found a problem with byte initilizers. Apparently this sequence:

li r1, >8C02
li r2, >D00
movb r2, *r1
li r1, >8C02
li r2, >8700
movb r2, *r1

was converted into this one, destroying the >0D value.

li r1, >8C02
li r2, >FF87
movb r2, *r1
swpb r2
movb r2, *r1

This was caused by not masking the values together before ORing them into the second constant. In this case the sign extended bits of >87 were stomping on >0D, resulting in a corrupt value. I'm posting the two-liner patch on the forum.

He also found another problem using -O0, which I usually avoid. The register I chose for the frame pointer, R8, is volatile. That means that it can be destroyed over a function call. This was just a dumb mistake. In order to preserve the ABI interface, I've moved the frame pointer to R9, which is preserved across function calls. The resulting code looks much safer now.
I was working on the EA5 converter today, and got it mostly working, but I've come to a design dilemma.

The conversion utility fills out a structure describing the layout of the data and bss sections. I can do the same thing in a linker script, and make the conversion simpler. However that adds complexity which is annoying, and potentially confusing for novice users.

As a related thought, I should look at the default linker script. It really should follow the requirements for EA5 files (.text at A000). I've got some thinking to do...

Saturday, July 23, 2011

This is what was posted to the AtariAge forums:

Update time!

It's about six months later than promised, but I haven't given up yet.

Most of that time has been putting in a ton of hours at work and beating on the GCC code to get byte operations working properly. What's in this release is the fourth or fifth overhaul of the port. In the end I had to rewrite core bits of how GCC relates byte and word quantities. I've kept those changes to a minimum, so ports to later versions should still work.

Here's what got changed in this patch:

Add optimization to remove redundant moves in int-to-char casts
Remove invalid CB compare immediate mode.
Add optimizations for byte immediate comparison
Added optimizations for shift and cast forms like (byte)X=(int)X>>N
Remove invalid compare immediate with memory
Improved support for subtract immediate
Fixed bug causing gibberish in assembly output
GCC now recognizes that bit shift operations set the comparison flags
Fixed bug causing bytewise AND to operate on the wrong byte
Add optimization for loading byte arrays into memory
Confirmed that variadic functions work properly.
Fixed the subtract instruction to handle constants
Fixed the CI instruction, it was allowing memory operands
Fixed a bug allowing the fake PC register to be used as a real register
Encourage memory-to-memory copies instead of mem-reg-mem copies
Added optimization to eliminate INV-INV-SZC sequences
Modify GCC's register allocation engine to handle TMS9900 byte values
Remove the 32 fake 8-bit registers. GCC now uses 16 16-bit registers
Modify memory addressing to handle forms like @LABEL+CONSTANT(Rn)
Clean up output assembly by vertically aligning operands
Clean up output by combining constant expressions
Optimize left shift byte quantities
Fixed a bug where SZC used the wrong register
Removed C instruction for "+=4" forms, AI is twice as fast
Added 32-bit negate
Fixed 32-bit subtract
Fixed a bug causing MUL to use the wrong register
Fixed a bug allowing shifts to use shift counts in the wrong register
Confirmed that inline assembly works correctly
Added optimization to convert "ANDI Rn, >00FF" to "SB Rn,Rn"
Optimize compare-with-zero instructions by using a temp register
Fixed a bug allowing *Rn and *Rn+ memory modes to be confused
Removed most warnings from the build process


There were also changes made to binutils, I hope this will be the last update for this.

More meaningful error messages from the assembler
DATA and BYTE constructs with no value did not allocate space
Fix core dump in tms9900-objdump during disassembly

The ELF conversion utility was also updated to allow crt0 to properly set memory before the C code executes. If it finds a "_init_data" label in the ELF file, it will fill out a record with all the information crt0 needs to do the initialization.

In light of all these changes, I've made a new "hello world" program with lots of comments, a Makefile and all supporting files. I've also included the compiled .o, .elf, and converted cart image. In addition, there's also a hello.s file which is the assembly output from the compiler.

I'm not sure if I mentioned this earlier, but the tms9900-as assembler will accept TI-syntax assembly files, but there are a number of additions:

Added "or", "orb" aliases for "soc" and "socb" (that's been a gotcha for a several people here)
Added "textz" directive - This appends a zero byte to the data.
textz "1234" is equivalent to "byte >31, >32, >33, >34, 0"
Added "ntext" directive - This prepends the byte count to the data.
"ntext '1234'" is equivalent to "byte 4, >31, >32, >33, >34"
Added "string" variants to all "text" directives
No length limit for label names
No limitation for constant calculations, all operations are allowed (xor, and, or, shifts, etc.)

It think thats about enough for now

I believe this is the biggest jump in usefulness yet. I've gone through and tested every instruction, and written several tests programs which did semi-interesting things from the compiler's point of view. They were, however, exceptionally dull from a user's point of view. For all the blow-by-blow details, check out my blog.

As a final test of the byte handling code, I built that chess program posted back in December. No problems were seen and no hinky-looking code was generated. In addition, it was about 5% smaller.

The build instructions are listed in post #43, and haven't changed since.

So, let me know what you think,

Friday, July 22, 2011

I thought I should write a "hello world" framework for anyone who wanted to use a working model as a starting point. That was apparently a good idea. I found a heck of a tricky bug to nail down. When performing a bytewise comparison with zero, the following instruction was generated:

mov *r2+, *r2+

This is incrementing the pointer twice. In this case, the comparison was used to find a null terminator for a string. The extra increment caused the terminator to be skipped, and the calculated length was nonsensical.

Fixed by using a temp register for the second argument. This is an optimization I was considering for a while. didn't fix 32-bit stuff yet.

This problem was also caused by not keeping a strict distinction between *Rn and *Rn+. If this isn't caught in instructions which use a repeated operand, like the one above, we will have some nasty side-effects.

Thursday, July 21, 2011

OK, I've gone through the source tree and removed all the dead files and reversed all whitespace changes to reduce the number of files which appear in the GCC patch.

I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.

All ready for release!
OK, I've gone through the source tree and removed all the dead files and reversed all whitespace changes to reduce the number of files which appear in the GCC patch.

I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.

All ready for release!

Monday, July 18, 2011

I got patches put together for binutils. GCC nees more cleanup. I need to remove predicates.md, and make prototypes for the .c functions to make the build process less cranky.

I was thinking about this some more, and I'm not quite ready to release GCC yet. I want to make sure a few things are done first:

Inline assembly with -G
Move all predicates into tms9900.md
Add function prototypes to remove compilation warnings
Move all test code out of the GCC tree.
Try position independent code


All done, but "-g" causes a crash:
coverage.c:304: internal compiler error: in simplify_subreg, at simplify-rtx.c:4959

Position independent code requires more substantial changes than I thought, so that won't work for now.

I think I'll release and do debug and PIC later.
I figured this was a pretty good time to fix my crt0 to do data initialization and BSS clearing. So, the crt0 was updated correctly, the elf2cart utility looks like it's OK, but I found a problem with GAS. When a DATA directive is used with no data value, no space is actually allocated.

That's bad.

But also fixed.

So at this point, initial values for variables are used as expected, and the conversion utility works like a champ.

Saturday, July 16, 2011

I got 32-bit multiply working, but it's awfully big. Since we only have instruction for 16-bit multiply, we need to expand the math.

R*G = (R0*K+R1)*(G0*K+G1) = R0*G0*K*K + R0*G1*K + R1*G0*K + R1*G1

At least we can omit the R0*G0 term since it won't fit into 32 bits. We will need a 32-bit temp value T stored in registers, and a 16-bit temp value H which could be stored in memory. That leaves us with this code:

mov r1, r5 # T0 = R0
mpy r4, r5 # T = R0 * G1
mov r6, r0 # H = T1
mov r2, r5 # T0 = R1
mpy r3, r5 # T = R1 * G0
mov r2, r1 # R0 = R1
mpy r4, r1 # R = R1 * G1
a r6, r1 # R1 += T1
a r0, r1 # R1 += H

Works fine, but this could be 30 bytes of code if the G and H terms are in memory. Not too bad if we only have one multiply per program, but for the compiler, I have to assume that won't be common.

So, it looks like 32-bit shifts, multiply, divide and modulo will be exiled to a library.

At long last, it looks like it's time for a release. I'm off to do cleanup.

Wednesday, July 13, 2011

I'm pretty confident about the 8 and 16 bit instructions, so I guess I can't put off checking the 32 bit instructions anymore. At least I can find out where the holes are, and which instructions should be moved to an external library.

to test:
shift right
shift left
logical shift right
multiply
divide
modulus

The shifts generate functional, if not efficient, code. I think I can get multiply without too much trouble, but divide and mudulus will have to be in an external library. I wrote the divide code earlier, and it requires a function of 25 instructions with lots of loops. I don't see how allowing that to be inlined would be a good idea.

Sunday, July 10, 2011

I found and fixed a stack problem in tms9900-objdump. The buffer allocated for the source address in the COC instruction was too short. Attempting to display an instruction like "coc @>1234(r5), r0" caused a stack overflow, crashing objdump. Fixed now. I hope that's the last binutils bug, but I've got a feeling more are still lurking in there.
More good news, inline assembly works just fine. I really didn't expect any problems, but it's good to know for sure.

I used 32-bit left shifts as my test for inlining, and I've improved quite a bit over the GCC generated code. That was 18 instructions, and awfully clunky. Here's the new code:

ci r0, 16
jeq shiftl_16
mov r2, r4
sla r1, r0
sla r2, r0
neg r0
srl r4, r0
soc r4, r1
ci r0, -16
jgt shiftl_end
shiftl_16:
mov r2, r1
clr r2
shiftl_end:

This is twelve instructions, and about 200 cycles. I can't see any good way to significantly reduce this sequence.

Wednesday, July 6, 2011

After my experience with word shifts, I decided to go back to byte shifts. The idea was to translate that effort and move on. Unfortunately, I noticed that GCC wants to do shifts as word quantities. I'm not sure why that is, but unfortunately, something like "(char)v>>=4" turns into:
sra r1, 8 * Convert from byte to word value
sra r1, 4 * Shift right the indicated amount
swpb r1 * Convert back to byte value

Ick. Using the existing shift-and-cast peephole optimizations I can reduce this to:
sra r1, 8 * Convert from byte to word value
sla r1, 4 * Shift right 4, shift left 8 to convert

Better, but not by much. I can add a peephole to change this sequence to a single shift, but I'll be back to this problem if the code is "(char)v = (v+1)>>4". The intervening add will defeat the peephole, bringing back the unnecessary byte-to-word and word-to-byte instructions.

From what I've seen so far, it looks like these promotions are expected behavior, and changing that would require getting into the guts of GCC again. I think I'll pass for now.

Saturday, July 2, 2011

Apparently, the constraints for the shift count register was too loose, resulting in unexpected registers being used in the IRA step, and later flagged as errors. I was seeing things like "srl r1, r2" which is super wrong.

This was fixed by rewriting all the shift instructions to use "define_expand" to copy the shift count into R0, then do the shift. Two instruction forms were then written, one which only acccepts R0 as the shift register, and then another which only accepts contants. The optimizer then eliminates the unneeded move when constant shifts are used. I'm awfully happy about how that works now.

Even though I don't have a 32-bit shift yet, GCC is happy to compose a sequence using 16-bit instructions. Unfortunately, that sequence is pretty big. "long shift_ar(long r, int n) {return(r>>n);}" gets converted to:

shift_ar
mov r3, r4
andi r4, >10
abs r4
jeq L2
mov r3, r0
mov r1, r2
sra r2, 0
sra r1, >F
b *r11
jmp L6
L2
mov r1, r4
sla r4, >1
mov r3, r0
inv r0
sla r4, 0
mov r3, r0
srl r2, 0
soc r4, r2
sra r1, 0
b *r11
L6

I think I can do better, but there are other things to do right now.