I've had a thought about these comparison patterns. If I make a pattern which only allows comparisons using registers, I can force GCC to use the efficient forms.
I'm removing the 32-bit comparisons to see what happens. Ideally, GCC will break this operation down into two comparisons. If that works, The .md file will be just a little bit smaller, which makes maintenance easier.
Here's an example with 32-bit values:
int compare_long_mem_eq_0(long *a) {return(*a == 0 ? 7: 9);}
Original output:
compare_long_reg_eq_0
mov r1, r3
mov r2, r4
ci r3, >0
jne $+>4
ci r4, >0
jeq L2
li r1, >9
b *r11
jmp L6
L2
li r1, >7
b *r11
L6
even
Results after removing "tstsi" and "cmpsi" patterns
compare_long_mem_eq_0
mov *r1, r2
soc @>2(r1), r2
mov r2, r1 <-- Not necessary, work on that later
jeq L51
li r1, >9
jmp L52
L51
li r1, >7
L52
b *r11
This is quite a bit better, and is the worst-looking output of all comparison forms. Most of the time, we can do the compare in one instruction. Typically, ORing the two words together.
Eliminating the 32-bit comparisons works really well, but causes the compiler to crash in some cases. It turns out that in the peephole optimization for comparing values to +-1 and +-2, I used a "general_operand" type, which allows memory references as well as registers. Unfortunately, I used "peephole2_reg_dead" to determine if that optimization should be used. That function explodes if called upon to check anything other than registers. If GCC tries to use one of those patterns when comparing a memory value, we will get a crash.
This was fixed by only allowing registers for the optimization pattern, and all is well. Honestly, that's what it should have been all along.
As a side effect of this work, the code that determines which instructions affect the condition flags has been rewritten from scratch. That code was originally taken from the PDP-11, and modified slightly. Now that I'm more familiar with the GCC internals, I can see where the flaws are. The new code is smaller, easier to follow as well as more correct.
Since the 32-bit comparison work went so well, I thought I'd monkey with the 16-bit comparisons. Instead of worrying about scratch registers, I'll just force all comparisons to be done in registers. Basically what will happen is that values will be copied into registers for the comparison, and later steps will realize that the move itself will set the comparison flags, resulting in clean looking output.
Sadly, that doesn't work out. Much more effort is done to set up the comparison than what I had before. Usually, there are several extra MOVs and no effort is made to use the additional optimizations. Oh well. What I have is still pretty good.
Saturday, September 10, 2011
Thursday, September 8, 2011
On the AtariAge forums, Tursi mentioned that C99 (I think) placed an EVEN after all string constants, which annoyed him. I can see why they did that, since I did the same thing in GCC. I'm not going to get to that right now, but I wanted to make a note of that so I wouldn't forget.
Currently, I'm looking into more bugs Lucien2 found while writing Rush Hour.
There was a problem with an "unrecognizable instruction" which was a nice easy fix. The problem here is that there was a pattern for comparisons against zero (X == 0), which called for a scratch register. This extra register is useful to handle cases where we need to compare against a post-incremented pointer (*r1+), or indexed addresses (@5(r1)). In these cases we can emit an instruction like "mov *r1+, r0" or "mov @5(r1), r0" which is smaller, faster, and has no side-effects.
If the need for a zero compare is made early enough, this scratch register request is ignored, and everything is fine. If this happens later, as the result of an optimization, GCC looks for an exact match. This fails, because of the scratch register request, which doesn't exist at this point.
This was fixed by making a second, unnamed pattern which compares against zero, but uses no scratch register. In this case, we emit an instruction like "mov @5(r1), @5(r1)", and disallow post-incremented pointers. This is not as efficient, but will work just fine.
Currently, I'm looking into more bugs Lucien2 found while writing Rush Hour.
There was a problem with an "unrecognizable instruction" which was a nice easy fix. The problem here is that there was a pattern for comparisons against zero (X == 0), which called for a scratch register. This extra register is useful to handle cases where we need to compare against a post-incremented pointer (*r1+), or indexed addresses (@5(r1)). In these cases we can emit an instruction like "mov *r1+, r0" or "mov @5(r1), r0" which is smaller, faster, and has no side-effects.
If the need for a zero compare is made early enough, this scratch register request is ignored, and everything is fine. If this happens later, as the result of an optimization, GCC looks for an exact match. This fails, because of the scratch register request, which doesn't exist at this point.
This was fixed by making a second, unnamed pattern which compares against zero, but uses no scratch register. In this case, we emit an instruction like "mov @5(r1), @5(r1)", and disallow post-incremented pointers. This is not as efficient, but will work just fine.
Wednesday, September 7, 2011
So I've been looking into this code, which is used to set a VDP address:
movb r9, @>8C02 <-- should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I turned on all debug output to dig into this sequence, and follow its development.
At some point during register alocation, this instruction is being deleted:
(insn 36 35 39 4 lucien2_5.c:18 (set (reg:QI 2 r2 [orig:40 i+1 ] [40])
(subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)) 72 {movqi} (nil))
I'm not sure why, that looks fine to me. Instead, GCC assumes the wrong byte usage, eventually resulting in "movb r9, @>8C02".
The subreg expression seems to be removed entirely, and is replaced with an instruction like "mov r2, r2", as seen below:
Reloads for insn # 36
Reload 0: reload_in (QI) = (reg:QI 9 r9)
REAL_REGS, RELOAD_FOR_INPUT (opnum = 1), can't combine
reload_in_reg: (subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)
reload_reg_rtx: (reg:QI 2 r2 [orig:40 i+1 ] [40])
Later this instruction is deleted, and since the subreg expression is lost, we use the wrong byte. So I need to find where this is being done.
Call tree for problem location:
ira
reload
alter_reg
df_ref_change_reg_with_loc
df_ref_change_reg_with_loc_1
reload_as_needed
subst_reloads
OK, here's the problem. The correct subreg expression needed for proper byte handling is removed in subst_reloads, but that code does not make the determination to do the removal. There is a data structure containing operations to apply for each instruction which subst_reloads duitfully applies. I need to find the code which makes the decisions. Which means I need to start over again. Ugh.
Once agin:
find_reloads
push_reload
The decision to do register substitution is done in find_reloads, but that determination is made due do a default handler. As I understand it, if a set instruction is used, but the input argument is to be reloaded, and that reload has not yet been made, the input is replaced with the output argument in an attempt to later remove this instruction. I now need to find why GCC needs to replace the subreg argument, since that looks perfectly fine to me. That decision looks to be somewhere in push_reload.
Nope. That is done in find_reloads. This is the most obtuse code I've looked at in a long time. The find_reloads function by itself is over 2000 lines of twisty logic. It's taken almost two weeks just to answer what I thought was a simple question.
It turns out that if there is a subreg expression as an instruction operand, GCC forces that expression to be reloaded, with the hope that it can be removed later. I have added a check to not do that in the case where a byte-to-word or word-to-byte expression is used. This preserves the subreg expression until instruction output, where we can emit the correct code.
So, at long last, we get good-looking code for this sequence:
mov r9, r2
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I'm cleaning out the TONS of debug output and path tracing code, and moving on to the next problem.
movb r9, @>8C02 <-- should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I turned on all debug output to dig into this sequence, and follow its development.
At some point during register alocation, this instruction is being deleted:
(insn 36 35 39 4 lucien2_5.c:18 (set (reg:QI 2 r2 [orig:40 i+1 ] [40])
(subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)) 72 {movqi} (nil))
I'm not sure why, that looks fine to me. Instead, GCC assumes the wrong byte usage, eventually resulting in "movb r9, @>8C02".
The subreg expression seems to be removed entirely, and is replaced with an instruction like "mov r2, r2", as seen below:
Reloads for insn # 36
Reload 0: reload_in (QI) = (reg:QI 9 r9)
REAL_REGS, RELOAD_FOR_INPUT (opnum = 1), can't combine
reload_in_reg: (subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)
reload_reg_rtx: (reg:QI 2 r2 [orig:40 i+1 ] [40])
Later this instruction is deleted, and since the subreg expression is lost, we use the wrong byte. So I need to find where this is being done.
Call tree for problem location:
ira
reload
alter_reg
df_ref_change_reg_with_loc
df_ref_change_reg_with_loc_1
reload_as_needed
subst_reloads
OK, here's the problem. The correct subreg expression needed for proper byte handling is removed in subst_reloads, but that code does not make the determination to do the removal. There is a data structure containing operations to apply for each instruction which subst_reloads duitfully applies. I need to find the code which makes the decisions. Which means I need to start over again. Ugh.
Once agin:
find_reloads
push_reload
The decision to do register substitution is done in find_reloads, but that determination is made due do a default handler. As I understand it, if a set instruction is used, but the input argument is to be reloaded, and that reload has not yet been made, the input is replaced with the output argument in an attempt to later remove this instruction. I now need to find why GCC needs to replace the subreg argument, since that looks perfectly fine to me. That decision looks to be somewhere in push_reload.
Nope. That is done in find_reloads. This is the most obtuse code I've looked at in a long time. The find_reloads function by itself is over 2000 lines of twisty logic. It's taken almost two weeks just to answer what I thought was a simple question.
It turns out that if there is a subreg expression as an instruction operand, GCC forces that expression to be reloaded, with the hope that it can be removed later. I have added a check to not do that in the case where a byte-to-word or word-to-byte expression is used. This preserves the subreg expression until instruction output, where we can emit the correct code.
So, at long last, we get good-looking code for this sequence:
mov r9, r2
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I'm cleaning out the TONS of debug output and path tracing code, and moving on to the next problem.
Friday, August 26, 2011
Here's an example of using the wrong byte:
movb r9, @>8C02 -> should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I'm not proud of the fix I made, but here it is:
There is a function named "reload_inner_reg_of_subreg" which allowed a subreg expression which assumes the byte is stored in the low byte of a register. I made a change to that function to force a reload in that case. So now we see code like this:
mov r9, r3
mov r3, r2 <-- Unnecessary MOV
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02
This will work, but there is an unnecessary "mov r3, r2" instruction. What's going on here is that we are reallocating the subreg subject. (In this example, allocate R3 instead of using R9). During the register reallocaction process, we find that we need to do extra work to get the byte value, which causes the "swpb" instruction to be emitted.
I think I can find a better way to fix this.
movb r9, @>8C02 -> should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02
I'm not proud of the fix I made, but here it is:
There is a function named "reload_inner_reg_of_subreg" which allowed a subreg expression which assumes the byte is stored in the low byte of a register. I made a change to that function to force a reload in that case. So now we see code like this:
mov r9, r3
mov r3, r2 <-- Unnecessary MOV
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02
This will work, but there is an unnecessary "mov r3, r2" instruction. What's going on here is that we are reallocating the subreg subject. (In this example, allocate R3 instead of using R9). During the register reallocaction process, we find that we need to do extra work to get the byte value, which causes the "swpb" instruction to be emitted.
I think I can find a better way to fix this.
Wednesday, August 24, 2011
I was looking at equates in the assembler, and noticed that for symbols defined before use end up with swapped bytes.
Example:
inc @equ
addr equ >1234
inc @equ
Ends up being assembled to:
inc @>3412
inc @>1234
Since the equate value is not known at the time the first reference is encountered, an internal fixup record is created for later evaluation. Unfortunately, md_apply_fix() used the wrong endianness when resolving these internal fixups. This was most likely an oversight when adapting code written for other targets to the TMS9900.
I also ran across do_org, which can assign a current address to assembled code. This allows skipping over memory in the code section. This is tempting to use for AORG, but I don't think that will work by itself. I need to think about this some more.
I've also added a new constant type to allow SBO, SBZ and TB to work properly.
Example:
inc @equ
addr equ >1234
inc @equ
Ends up being assembled to:
inc @>3412
inc @>1234
Since the equate value is not known at the time the first reference is encountered, an internal fixup record is created for later evaluation. Unfortunately, md_apply_fix() used the wrong endianness when resolving these internal fixups. This was most likely an oversight when adapting code written for other targets to the TMS9900.
I also ran across do_org, which can assign a current address to assembled code. This allows skipping over memory in the code section. This is tempting to use for AORG, but I don't think that will work by itself. I need to think about this some more.
I've also added a new constant type to allow SBO, SBZ and TB to work properly.
Tuesday, August 23, 2011
I've got my V9T9 disk tool almost done. I'm missing a few FIB fields when working with data files, but this should be easy to fix later.
I got a bug report that STST is causing problems in the assembler. The problem with STST is that I accidentally configured GAS to look for two arguments, when the instruction only takes one. If someone attempts to use the instruction properly, it will complain and emit an error. This is an easy one line fix.
It also looks like the CRU instructions need some attention. SBO SBZ and TB are set to use a constant in the same way as JMP, which doen't seem right.
It's late, so that will have to wait for tomorrow.
There was also a report that late EQU's were not being evaluated properly if referenced before the assignation.
Also, there's a typecast bug I have yet to look at.
I got a bug report that STST is causing problems in the assembler. The problem with STST is that I accidentally configured GAS to look for two arguments, when the instruction only takes one. If someone attempts to use the instruction properly, it will complain and emit an error. This is an easy one line fix.
It also looks like the CRU instructions need some attention. SBO SBZ and TB are set to use a constant in the same way as JMP, which doen't seem right.
It's late, so that will have to wait for tomorrow.
There was also a report that late EQU's were not being evaluated properly if referenced before the assignation.
Also, there's a typecast bug I have yet to look at.
Tuesday, August 16, 2011
I was looking a 32-bit division, and I was trying to find a way to take advantage of the 16-bit DIV instruction. Here's what I came up with:
X: numerator
Y: denominator
Q: ratio
N: 16-bit radix (2^16)
X/Y = Q
X = A*N+B
Y = C*N+D
N = 1<<16
Replace terms:
(A*N+B)/(C*N+D) = Q
B will be lost due to significant figures
Multiply denominator by ((1/C)/(1/C)), this eliminates 32-bit division:
(A*N)/((N+D/C)/C) = Q
Multiply numerator by ((1/(N+D/C))/(1/(N+D/C))):
((A*N)/((N+D/C))/C = Q
Multiply numerator by ((1/2)/(1/2)), this ensures all partial terms fit into a 16-bit quantity:
((A*N/2)/((N+D/C)/2))/C = Q
Decompose into partial terms:
V1 = D/C
V2 = N/2+V1/2
V3 = (A*N/2)/V2
P = V3/C
Due to the stackup of integer truncation, there will be rounding errors. Testing over the range of valid inputs shows that the result is accurate to +-1. Another step is required to fix this approximation.
Account for the approximation (Z is error due to truncation):
Q = P+Z
Replace Q with earlier equation
P+Z = A*N/(C*N+D)
Multiply both sides by (C*N+D):
A*N = (P*C*N+P*D)+(Z*C*N+Z*D)
Divide both sides by N, solve for Z.
Terms involving D are lost due to significant figures
A - P*C = Z*C
If Z<=0, any truncation error is covered by rounding error
If Z>0, the estimate "P" is one greater than the true result
So:
if(A > P*C), P := P-1
Untested assembly below. X Passed on [r1,r2], Y Passed on [r3,r4], result passed in [r1,r2], this assumes unsigned operands
# Cycles
mov r4, r5 # 14 : Copy D to temp register
clr r4 # 10 : Clear high word, prepare for division
div r3, r4 # 124 : R4 = C/D {V1}
srl r4, 1 # 14
ai r4, >1000 # 18 : R4 = N/2+V1/2 {V2}
mov r1, r5 # 14 : Save unmodified A for later
mov r1, r2 # 14
src r2, 1 # 14
ai r2, >1000 # 18
srl r1, 1 # 14 : [R1,R2] = A*N/2
div r1, r4 # 124 : R1 = (A*N/2)/V2 {V3}
mov r1, r2 # 14
clr r1 # 10
div r3, r1 # 124 : R1 = V3/C {P}
mpy r1, r3 # 52 : [R3,R4] = P*C
c r1, r5 # 14 : Compare A to P*C
jle +2 # 8 :
dec r1 # 10 :
mov r1, r2 # 14 : Move result into proper registers
clr r1 # 10 : [R1,R2] = P
total 634 clocks, 44 bytes (722 clocks including instruction loads)
This compares well to an earlier method using shifts and subtracts. That method uses a maximum of 7394 clocks and 50 bytes. (10082 clocks including instruction loads)
Assuming this all works out, this approach is fourteen times faster with a smaller footprint. I think I have a winner.
X: numerator
Y: denominator
Q: ratio
N: 16-bit radix (2^16)
X/Y = Q
X = A*N+B
Y = C*N+D
N = 1<<16
Replace terms:
(A*N+B)/(C*N+D) = Q
B will be lost due to significant figures
Multiply denominator by ((1/C)/(1/C)), this eliminates 32-bit division:
(A*N)/((N+D/C)/C) = Q
Multiply numerator by ((1/(N+D/C))/(1/(N+D/C))):
((A*N)/((N+D/C))/C = Q
Multiply numerator by ((1/2)/(1/2)), this ensures all partial terms fit into a 16-bit quantity:
((A*N/2)/((N+D/C)/2))/C = Q
Decompose into partial terms:
V1 = D/C
V2 = N/2+V1/2
V3 = (A*N/2)/V2
P = V3/C
Due to the stackup of integer truncation, there will be rounding errors. Testing over the range of valid inputs shows that the result is accurate to +-1. Another step is required to fix this approximation.
Account for the approximation (Z is error due to truncation):
Q = P+Z
Replace Q with earlier equation
P+Z = A*N/(C*N+D)
Multiply both sides by (C*N+D):
A*N = (P*C*N+P*D)+(Z*C*N+Z*D)
Divide both sides by N, solve for Z.
Terms involving D are lost due to significant figures
A - P*C = Z*C
If Z<=0, any truncation error is covered by rounding error
If Z>0, the estimate "P" is one greater than the true result
So:
if(A > P*C), P := P-1
Untested assembly below. X Passed on [r1,r2], Y Passed on [r3,r4], result passed in [r1,r2], this assumes unsigned operands
# Cycles
mov r4, r5 # 14 : Copy D to temp register
clr r4 # 10 : Clear high word, prepare for division
div r3, r4 # 124 : R4 = C/D {V1}
srl r4, 1 # 14
ai r4, >1000 # 18 : R4 = N/2+V1/2 {V2}
mov r1, r5 # 14 : Save unmodified A for later
mov r1, r2 # 14
src r2, 1 # 14
ai r2, >1000 # 18
srl r1, 1 # 14 : [R1,R2] = A*N/2
div r1, r4 # 124 : R1 = (A*N/2)/V2 {V3}
mov r1, r2 # 14
clr r1 # 10
div r3, r1 # 124 : R1 = V3/C {P}
mpy r1, r3 # 52 : [R3,R4] = P*C
c r1, r5 # 14 : Compare A to P*C
jle +2 # 8 :
dec r1 # 10 :
mov r1, r2 # 14 : Move result into proper registers
clr r1 # 10 : [R1,R2] = P
total 634 clocks, 44 bytes (722 clocks including instruction loads)
This compares well to an earlier method using shifts and subtracts. That method uses a maximum of 7394 clocks and 50 bytes. (10082 clocks including instruction loads)
Assuming this all works out, this approach is fourteen times faster with a smaller footprint. I think I have a winner.
Monday, August 8, 2011
OK, it's patch time again.
Here's the changes in this release:
Fixed bug with byte initilizers, it was handling negative values wrongly
Fixed multiply bug, it was using the wrong registers
Changed grame pointer from R8 to R9. Frame was being lost
Byte reads from memory were assumed to be copied into register's LSB.
Fixed a problem with AND improperly modifying temp values.
Fixed a bug where R11 was not saved if used as a temp register.
Modified output to use hex values for all constants
I've also packaged up the ELF to EA5 converter and an example program made to run as an EA5 image.
The next thing on my list is to update all the documentation. Everything I've posted so far is still valid, but there are probably holes where some subjects need more information.
I also need to put together a library for the missing 32- but functions (multiply, divide, modulus, shift). These functions are alredy written and tested for the most psrt, so releasing them should be quick and easy.
Finally, I need to make my V9T9 disk tool ready for public consumption. It currently works, and the disk images it creates were used to test the EA5 converter, but it's super hacky at the moment. Once I spruce it up a bit and turn it into a useful tool, I can send it out the door.
Here's the changes in this release:
Fixed bug with byte initilizers, it was handling negative values wrongly
Fixed multiply bug, it was using the wrong registers
Changed grame pointer from R8 to R9. Frame was being lost
Byte reads from memory were assumed to be copied into register's LSB.
Fixed a problem with AND improperly modifying temp values.
Fixed a bug where R11 was not saved if used as a temp register.
Modified output to use hex values for all constants
I've also packaged up the ELF to EA5 converter and an example program made to run as an EA5 image.
The next thing on my list is to update all the documentation. Everything I've posted so far is still valid, but there are probably holes where some subjects need more information.
I also need to put together a library for the missing 32- but functions (multiply, divide, modulus, shift). These functions are alredy written and tested for the most psrt, so releasing them should be quick and easy.
Finally, I need to make my V9T9 disk tool ready for public consumption. It currently works, and the disk images it creates were used to test the EA5 converter, but it's super hacky at the moment. Once I spruce it up a bit and turn it into a useful tool, I can send it out the door.
Sunday, August 7, 2011
I was just about to devliver some patches, but remembered that I wanted to confirm that R11 and the fake PC register were working as expected. The PC works just fine, but R11 was not being saved properly if it was used as a temp register in a leaf function.
I found this out by making a function with 16 volatile ints stored on that stack, and returned the sum of all of them. That prevented the optimizer from removing one of these values, ensuring that all registers would be used. Well, all except the stack pointer, that's off-limits for obvious reasons. The C code and assembly output are shown below.
So, with that fixed and out of the way, I can finish the release.
int regtest(int a1, int a2, int a3, int a4, int a5, int a6)
{
volatile a7, a8, a9, a10, a11, a12, a13, a14, a15, a16;
return (a1+a2+a3+a4+a5+a6+a7+a8+a9+a10+a11+a12+a13+a14+a15+a16);
}
regtest
# R1 - R6 are used by the function arguments,
# making this test slightly smaller
# Allocate 30 bytes on the stack
ai r10, >FFE2
# Save non-volatile registers to the stack
mov r10, r0
mov r11, *r0+
mov r9, *r0+
mov r13, *r0+
mov r14, *r0+
mov r15, *r0
# Copy our junk data from the stack to registers
mov *r10, r15
mov @>2(r10), r6
mov @>4(r10), r14
mov @>6(r10), r13
mov @>8(r10), r9
mov @>A(r10), r11
mov @>C(r10), r0
mov @>E(r10), r12
mov @>10(r10), r8
mov @>12(r10), r7
# Add everything up
a r15, r6
a r14, r6
a r13, r6
a r9, r6
a r11, r6
a r0, r6
a r12, r6
a r8, r6
a r7, r6
a r1, r6
a r2, r6
a r3, r6
a r4, r6
a r5, r6
mov r6, r1
# As expected, the 16th value didn't make it into a register
a @>1C(r10), r1
# Restore non-volatile registers from the stack
# This also frees 8 bytes of stack space
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
mov *r10+, r14
mov *r10, r15
# Free the rest of the allocated stack space (22 + 8 = 30)
ai r10, >16
b *r11
I found this out by making a function with 16 volatile ints stored on that stack, and returned the sum of all of them. That prevented the optimizer from removing one of these values, ensuring that all registers would be used. Well, all except the stack pointer, that's off-limits for obvious reasons. The C code and assembly output are shown below.
So, with that fixed and out of the way, I can finish the release.
int regtest(int a1, int a2, int a3, int a4, int a5, int a6)
{
volatile a7, a8, a9, a10, a11, a12, a13, a14, a15, a16;
return (a1+a2+a3+a4+a5+a6+a7+a8+a9+a10+a11+a12+a13+a14+a15+a16);
}
regtest
# R1 - R6 are used by the function arguments,
# making this test slightly smaller
# Allocate 30 bytes on the stack
ai r10, >FFE2
# Save non-volatile registers to the stack
mov r10, r0
mov r11, *r0+
mov r9, *r0+
mov r13, *r0+
mov r14, *r0+
mov r15, *r0
# Copy our junk data from the stack to registers
mov *r10, r15
mov @>2(r10), r6
mov @>4(r10), r14
mov @>6(r10), r13
mov @>8(r10), r9
mov @>A(r10), r11
mov @>C(r10), r0
mov @>E(r10), r12
mov @>10(r10), r8
mov @>12(r10), r7
# Add everything up
a r15, r6
a r14, r6
a r13, r6
a r9, r6
a r11, r6
a r0, r6
a r12, r6
a r8, r6
a r7, r6
a r1, r6
a r2, r6
a r3, r6
a r4, r6
a r5, r6
mov r6, r1
# As expected, the 16th value didn't make it into a register
a @>1C(r10), r1
# Restore non-volatile registers from the stack
# This also frees 8 bytes of stack space
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
mov *r10+, r14
mov *r10, r15
# Free the rest of the allocated stack space (22 + 8 = 30)
ai r10, >16
b *r11
Saturday, August 6, 2011
Well, it turns out that I can't use define_split for AND, since GCC requires exactly two RTL expressions after the split. Unfortunately, I need three split expressions for AND. What ends up happening in that case is that the entire split pattern is ignored, and errors are emitted during compilation.
What I ended up doing was to use the ANDHI3 pattern for these AND expressions, and use unnamed patterns for ANDI and AND with an inverted operand. As a named pattern, I can demand scratch registers. The unnamed patterns are only invoked if they match exactly, so they can't use scratches. This seems to work pretty well. Here's the output for "int and_mem_mem() {return(memb_a & memb_b);}":
and_mem_mem
movb @memb_b, r1
movb @memb_a, r2
inv r2
szcb r2, r1
sra r1, 8
b *r11
The instruction order is slightly different than I expected, but this is perfectly fine. The word AND forms have been modifified to match the changes I made to the byte forms.
With this out of the way, I can make a new set of patches and update the documentation. It looks like people are getting interested in using the compiler for their own projects, and I need to make sure all the information they might need is correct and available.
What I ended up doing was to use the ANDHI3 pattern for these AND expressions, and use unnamed patterns for ANDI and AND with an inverted operand. As a named pattern, I can demand scratch registers. The unnamed patterns are only invoked if they match exactly, so they can't use scratches. This seems to work pretty well. Here's the output for "int and_mem_mem() {return(memb_a & memb_b);}":
and_mem_mem
movb @memb_b, r1
movb @memb_a, r2
inv r2
szcb r2, r1
sra r1, 8
b *r11
The instruction order is slightly different than I expected, but this is perfectly fine. The word AND forms have been modifified to match the changes I made to the byte forms.
With this out of the way, I can make a new set of patches and update the documentation. It looks like people are getting interested in using the compiler for their own projects, and I need to make sure all the information they might need is correct and available.
Thursday, August 4, 2011
I've found the cause of the omitted type casts. It turns out that I missed a branch in the instruction combination step which specifically checks for extension of values stored in memory. The exiting code did the usual job of assuming that typecasts from byte to word can be removed without consequence. Of course this is a tragic mistake for the TMS9900.
I need to find a way to allocate a scratch register during split to convert this:
and_mem_mem
movb @memb_a, r1
inv r1
movb @memb_b, r2
szcb r1, r2
movb r2, r1
sra r1, 8
b *r11
to this:
and_mem_mem
movb @memb_a, r2
inv r2
movb @memb_b, r1
szcb r2, r1 <--- eliminates MOV instruction
sra r1, 8
b *r11
Other stuff to do:
Check PC as fixed function again, might be different now.
Check R11 used as data register in non-leaf functions
I need to find a way to allocate a scratch register during split to convert this:
and_mem_mem
movb @memb_a, r1
inv r1
movb @memb_b, r2
szcb r1, r2
movb r2, r1
sra r1, 8
b *r11
to this:
and_mem_mem
movb @memb_a, r2
inv r2
movb @memb_b, r1
szcb r2, r1 <--- eliminates MOV instruction
sra r1, 8
b *r11
Other stuff to do:
Check PC as fixed function again, might be different now.
Check R11 used as data register in non-leaf functions
Tuesday, August 2, 2011
Over on the AtariAge forums, Lucien2 is finding all sorts of neat stuff that's broken.
Here's the latest:
int test()
{
return( (*(char*)0x837C) & 31);
}
This gets converted to:
test
movb @>837C, r1
andi r1, >1F
b *r11
GCC is falsely assuming that the byte value is stored in the low-order byte of R1. It's optimizing out the typecast, resulting in bad code.
I've changed all the constant address code to use hex values instead of decimal. It looks much better now.
In order to unify output, I should convert all shift-by-eights to use hex shift counts.
Another odd thing:
int test()
{
return( (*(char*)0x837C) & (*(char*)0x820C));
}
Gets converted to this:
test
inv @>837C
movb @>820C, r1
szcb @>837C, r1
Technicallly, we do need to invert before using SZCB, but we shouldn't invert memory without a good reason. That invert should be done in a register.
Here's the latest:
int test()
{
return( (*(char*)0x837C) & 31);
}
This gets converted to:
test
movb @>837C, r1
andi r1, >1F
b *r11
GCC is falsely assuming that the byte value is stored in the low-order byte of R1. It's optimizing out the typecast, resulting in bad code.
I've changed all the constant address code to use hex values instead of decimal. It looks much better now.
In order to unify output, I should convert all shift-by-eights to use hex shift counts.
Another odd thing:
int test()
{
return( (*(char*)0x837C) & (*(char*)0x820C));
}
Gets converted to this:
test
inv @>837C
movb @>820C, r1
szcb @>837C, r1
Technicallly, we do need to invert before using SZCB, but we shouldn't invert memory without a good reason. That invert should be done in a register.
Monday, July 25, 2011
On the AtariAge forums, lucien2 found a problem with byte initilizers. Apparently this sequence:
li r1, >8C02
li r2, >D00
movb r2, *r1
li r1, >8C02
li r2, >8700
movb r2, *r1
was converted into this one, destroying the >0D value.
li r1, >8C02
li r2, >FF87
movb r2, *r1
swpb r2
movb r2, *r1
This was caused by not masking the values together before ORing them into the second constant. In this case the sign extended bits of >87 were stomping on >0D, resulting in a corrupt value. I'm posting the two-liner patch on the forum.
He also found another problem using -O0, which I usually avoid. The register I chose for the frame pointer, R8, is volatile. That means that it can be destroyed over a function call. This was just a dumb mistake. In order to preserve the ABI interface, I've moved the frame pointer to R9, which is preserved across function calls. The resulting code looks much safer now.
li r1, >8C02
li r2, >D00
movb r2, *r1
li r1, >8C02
li r2, >8700
movb r2, *r1
was converted into this one, destroying the >0D value.
li r1, >8C02
li r2, >FF87
movb r2, *r1
swpb r2
movb r2, *r1
This was caused by not masking the values together before ORing them into the second constant. In this case the sign extended bits of >87 were stomping on >0D, resulting in a corrupt value. I'm posting the two-liner patch on the forum.
He also found another problem using -O0, which I usually avoid. The register I chose for the frame pointer, R8, is volatile. That means that it can be destroyed over a function call. This was just a dumb mistake. In order to preserve the ABI interface, I've moved the frame pointer to R9, which is preserved across function calls. The resulting code looks much safer now.
I was working on the EA5 converter today, and got it mostly working, but I've come to a design dilemma.
The conversion utility fills out a structure describing the layout of the data and bss sections. I can do the same thing in a linker script, and make the conversion simpler. However that adds complexity which is annoying, and potentially confusing for novice users.
As a related thought, I should look at the default linker script. It really should follow the requirements for EA5 files (.text at A000). I've got some thinking to do...
The conversion utility fills out a structure describing the layout of the data and bss sections. I can do the same thing in a linker script, and make the conversion simpler. However that adds complexity which is annoying, and potentially confusing for novice users.
As a related thought, I should look at the default linker script. It really should follow the requirements for EA5 files (.text at A000). I've got some thinking to do...
Saturday, July 23, 2011
This is what was posted to the AtariAge forums:
Update time!
It's about six months later than promised, but I haven't given up yet.
Most of that time has been putting in a ton of hours at work and beating on the GCC code to get byte operations working properly. What's in this release is the fourth or fifth overhaul of the port. In the end I had to rewrite core bits of how GCC relates byte and word quantities. I've kept those changes to a minimum, so ports to later versions should still work.
Here's what got changed in this patch:
Add optimization to remove redundant moves in int-to-char casts
Remove invalid CB compare immediate mode.
Add optimizations for byte immediate comparison
Added optimizations for shift and cast forms like (byte)X=(int)X>>N
Remove invalid compare immediate with memory
Improved support for subtract immediate
Fixed bug causing gibberish in assembly output
GCC now recognizes that bit shift operations set the comparison flags
Fixed bug causing bytewise AND to operate on the wrong byte
Add optimization for loading byte arrays into memory
Confirmed that variadic functions work properly.
Fixed the subtract instruction to handle constants
Fixed the CI instruction, it was allowing memory operands
Fixed a bug allowing the fake PC register to be used as a real register
Encourage memory-to-memory copies instead of mem-reg-mem copies
Added optimization to eliminate INV-INV-SZC sequences
Modify GCC's register allocation engine to handle TMS9900 byte values
Remove the 32 fake 8-bit registers. GCC now uses 16 16-bit registers
Modify memory addressing to handle forms like @LABEL+CONSTANT(Rn)
Clean up output assembly by vertically aligning operands
Clean up output by combining constant expressions
Optimize left shift byte quantities
Fixed a bug where SZC used the wrong register
Removed C instruction for "+=4" forms, AI is twice as fast
Added 32-bit negate
Fixed 32-bit subtract
Fixed a bug causing MUL to use the wrong register
Fixed a bug allowing shifts to use shift counts in the wrong register
Confirmed that inline assembly works correctly
Added optimization to convert "ANDI Rn, >00FF" to "SB Rn,Rn"
Optimize compare-with-zero instructions by using a temp register
Fixed a bug allowing *Rn and *Rn+ memory modes to be confused
Removed most warnings from the build process
There were also changes made to binutils, I hope this will be the last update for this.
More meaningful error messages from the assembler
DATA and BYTE constructs with no value did not allocate space
Fix core dump in tms9900-objdump during disassembly
The ELF conversion utility was also updated to allow crt0 to properly set memory before the C code executes. If it finds a "_init_data" label in the ELF file, it will fill out a record with all the information crt0 needs to do the initialization.
In light of all these changes, I've made a new "hello world" program with lots of comments, a Makefile and all supporting files. I've also included the compiled .o, .elf, and converted cart image. In addition, there's also a hello.s file which is the assembly output from the compiler.
I'm not sure if I mentioned this earlier, but the tms9900-as assembler will accept TI-syntax assembly files, but there are a number of additions:
Added "or", "orb" aliases for "soc" and "socb" (that's been a gotcha for a several people here)
Added "textz" directive - This appends a zero byte to the data.
textz "1234" is equivalent to "byte >31, >32, >33, >34, 0"
Added "ntext" directive - This prepends the byte count to the data.
"ntext '1234'" is equivalent to "byte 4, >31, >32, >33, >34"
Added "string" variants to all "text" directives
No length limit for label names
No limitation for constant calculations, all operations are allowed (xor, and, or, shifts, etc.)
It think thats about enough for now
I believe this is the biggest jump in usefulness yet. I've gone through and tested every instruction, and written several tests programs which did semi-interesting things from the compiler's point of view. They were, however, exceptionally dull from a user's point of view. For all the blow-by-blow details, check out my blog.
As a final test of the byte handling code, I built that chess program posted back in December. No problems were seen and no hinky-looking code was generated. In addition, it was about 5% smaller.
The build instructions are listed in post #43, and haven't changed since.
So, let me know what you think,
Update time!
It's about six months later than promised, but I haven't given up yet.
Most of that time has been putting in a ton of hours at work and beating on the GCC code to get byte operations working properly. What's in this release is the fourth or fifth overhaul of the port. In the end I had to rewrite core bits of how GCC relates byte and word quantities. I've kept those changes to a minimum, so ports to later versions should still work.
Here's what got changed in this patch:
Add optimization to remove redundant moves in int-to-char casts
Remove invalid CB compare immediate mode.
Add optimizations for byte immediate comparison
Added optimizations for shift and cast forms like (byte)X=(int)X>>N
Remove invalid compare immediate with memory
Improved support for subtract immediate
Fixed bug causing gibberish in assembly output
GCC now recognizes that bit shift operations set the comparison flags
Fixed bug causing bytewise AND to operate on the wrong byte
Add optimization for loading byte arrays into memory
Confirmed that variadic functions work properly.
Fixed the subtract instruction to handle constants
Fixed the CI instruction, it was allowing memory operands
Fixed a bug allowing the fake PC register to be used as a real register
Encourage memory-to-memory copies instead of mem-reg-mem copies
Added optimization to eliminate INV-INV-SZC sequences
Modify GCC's register allocation engine to handle TMS9900 byte values
Remove the 32 fake 8-bit registers. GCC now uses 16 16-bit registers
Modify memory addressing to handle forms like @LABEL+CONSTANT(Rn)
Clean up output assembly by vertically aligning operands
Clean up output by combining constant expressions
Optimize left shift byte quantities
Fixed a bug where SZC used the wrong register
Removed C instruction for "+=4" forms, AI is twice as fast
Added 32-bit negate
Fixed 32-bit subtract
Fixed a bug causing MUL to use the wrong register
Fixed a bug allowing shifts to use shift counts in the wrong register
Confirmed that inline assembly works correctly
Added optimization to convert "ANDI Rn, >00FF" to "SB Rn,Rn"
Optimize compare-with-zero instructions by using a temp register
Fixed a bug allowing *Rn and *Rn+ memory modes to be confused
Removed most warnings from the build process
There were also changes made to binutils, I hope this will be the last update for this.
More meaningful error messages from the assembler
DATA and BYTE constructs with no value did not allocate space
Fix core dump in tms9900-objdump during disassembly
The ELF conversion utility was also updated to allow crt0 to properly set memory before the C code executes. If it finds a "_init_data" label in the ELF file, it will fill out a record with all the information crt0 needs to do the initialization.
In light of all these changes, I've made a new "hello world" program with lots of comments, a Makefile and all supporting files. I've also included the compiled .o, .elf, and converted cart image. In addition, there's also a hello.s file which is the assembly output from the compiler.
I'm not sure if I mentioned this earlier, but the tms9900-as assembler will accept TI-syntax assembly files, but there are a number of additions:
Added "or", "orb" aliases for "soc" and "socb" (that's been a gotcha for a several people here)
Added "textz" directive - This appends a zero byte to the data.
textz "1234" is equivalent to "byte >31, >32, >33, >34, 0"
Added "ntext" directive - This prepends the byte count to the data.
"ntext '1234'" is equivalent to "byte 4, >31, >32, >33, >34"
Added "string" variants to all "text" directives
No length limit for label names
No limitation for constant calculations, all operations are allowed (xor, and, or, shifts, etc.)
It think thats about enough for now
I believe this is the biggest jump in usefulness yet. I've gone through and tested every instruction, and written several tests programs which did semi-interesting things from the compiler's point of view. They were, however, exceptionally dull from a user's point of view. For all the blow-by-blow details, check out my blog.
As a final test of the byte handling code, I built that chess program posted back in December. No problems were seen and no hinky-looking code was generated. In addition, it was about 5% smaller.
The build instructions are listed in post #43, and haven't changed since.
So, let me know what you think,
Friday, July 22, 2011
I thought I should write a "hello world" framework for anyone who wanted to use a working model as a starting point. That was apparently a good idea. I found a heck of a tricky bug to nail down. When performing a bytewise comparison with zero, the following instruction was generated:
mov *r2+, *r2+
This is incrementing the pointer twice. In this case, the comparison was used to find a null terminator for a string. The extra increment caused the terminator to be skipped, and the calculated length was nonsensical.
Fixed by using a temp register for the second argument. This is an optimization I was considering for a while. didn't fix 32-bit stuff yet.
This problem was also caused by not keeping a strict distinction between *Rn and *Rn+. If this isn't caught in instructions which use a repeated operand, like the one above, we will have some nasty side-effects.
mov *r2+, *r2+
This is incrementing the pointer twice. In this case, the comparison was used to find a null terminator for a string. The extra increment caused the terminator to be skipped, and the calculated length was nonsensical.
Fixed by using a temp register for the second argument. This is an optimization I was considering for a while. didn't fix 32-bit stuff yet.
This problem was also caused by not keeping a strict distinction between *Rn and *Rn+. If this isn't caught in instructions which use a repeated operand, like the one above, we will have some nasty side-effects.
Thursday, July 21, 2011
OK, I've gone through the source tree and removed all the dead files and reversed all whitespace changes to reduce the number of files which appear in the GCC patch.
I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.
All ready for release!
I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.
All ready for release!
OK, I've gone through the source tree and removed all the dead files and reversed all whitespace changes to reduce the number of files which appear in the GCC patch.
I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.
All ready for release!
I've also found a use for that SB instruction I've mentioned a few times. That will be used to replace an "ANDI Rn, >00FF" instruction. I don't think that constant will be used very often, but this was mostly done to make me feel better.
All ready for release!
Monday, July 18, 2011
I got patches put together for binutils. GCC nees more cleanup. I need to remove predicates.md, and make prototypes for the .c functions to make the build process less cranky.
I was thinking about this some more, and I'm not quite ready to release GCC yet. I want to make sure a few things are done first:
Inline assembly with -G
Move all predicates into tms9900.md
Add function prototypes to remove compilation warnings
Move all test code out of the GCC tree.
Try position independent code
All done, but "-g" causes a crash:
coverage.c:304: internal compiler error: in simplify_subreg, at simplify-rtx.c:4959
Position independent code requires more substantial changes than I thought, so that won't work for now.
I think I'll release and do debug and PIC later.
I was thinking about this some more, and I'm not quite ready to release GCC yet. I want to make sure a few things are done first:
Inline assembly with -G
Move all predicates into tms9900.md
Add function prototypes to remove compilation warnings
Move all test code out of the GCC tree.
Try position independent code
All done, but "-g" causes a crash:
coverage.c:304: internal compiler error: in simplify_subreg, at simplify-rtx.c:4959
Position independent code requires more substantial changes than I thought, so that won't work for now.
I think I'll release and do debug and PIC later.
I figured this was a pretty good time to fix my crt0 to do data initialization and BSS clearing. So, the crt0 was updated correctly, the elf2cart utility looks like it's OK, but I found a problem with GAS. When a DATA directive is used with no data value, no space is actually allocated.
That's bad.
But also fixed.
So at this point, initial values for variables are used as expected, and the conversion utility works like a champ.
That's bad.
But also fixed.
So at this point, initial values for variables are used as expected, and the conversion utility works like a champ.
Saturday, July 16, 2011
I got 32-bit multiply working, but it's awfully big. Since we only have instruction for 16-bit multiply, we need to expand the math.
R*G = (R0*K+R1)*(G0*K+G1) = R0*G0*K*K + R0*G1*K + R1*G0*K + R1*G1
At least we can omit the R0*G0 term since it won't fit into 32 bits. We will need a 32-bit temp value T stored in registers, and a 16-bit temp value H which could be stored in memory. That leaves us with this code:
mov r1, r5 # T0 = R0
mpy r4, r5 # T = R0 * G1
mov r6, r0 # H = T1
mov r2, r5 # T0 = R1
mpy r3, r5 # T = R1 * G0
mov r2, r1 # R0 = R1
mpy r4, r1 # R = R1 * G1
a r6, r1 # R1 += T1
a r0, r1 # R1 += H
Works fine, but this could be 30 bytes of code if the G and H terms are in memory. Not too bad if we only have one multiply per program, but for the compiler, I have to assume that won't be common.
So, it looks like 32-bit shifts, multiply, divide and modulo will be exiled to a library.
At long last, it looks like it's time for a release. I'm off to do cleanup.
R*G = (R0*K+R1)*(G0*K+G1) = R0*G0*K*K + R0*G1*K + R1*G0*K + R1*G1
At least we can omit the R0*G0 term since it won't fit into 32 bits. We will need a 32-bit temp value T stored in registers, and a 16-bit temp value H which could be stored in memory. That leaves us with this code:
mov r1, r5 # T0 = R0
mpy r4, r5 # T = R0 * G1
mov r6, r0 # H = T1
mov r2, r5 # T0 = R1
mpy r3, r5 # T = R1 * G0
mov r2, r1 # R0 = R1
mpy r4, r1 # R = R1 * G1
a r6, r1 # R1 += T1
a r0, r1 # R1 += H
Works fine, but this could be 30 bytes of code if the G and H terms are in memory. Not too bad if we only have one multiply per program, but for the compiler, I have to assume that won't be common.
So, it looks like 32-bit shifts, multiply, divide and modulo will be exiled to a library.
At long last, it looks like it's time for a release. I'm off to do cleanup.
Wednesday, July 13, 2011
I'm pretty confident about the 8 and 16 bit instructions, so I guess I can't put off checking the 32 bit instructions anymore. At least I can find out where the holes are, and which instructions should be moved to an external library.
to test:
shift right
shift left
logical shift right
multiply
divide
modulus
The shifts generate functional, if not efficient, code. I think I can get multiply without too much trouble, but divide and mudulus will have to be in an external library. I wrote the divide code earlier, and it requires a function of 25 instructions with lots of loops. I don't see how allowing that to be inlined would be a good idea.
to test:
shift right
shift left
logical shift right
multiply
divide
modulus
The shifts generate functional, if not efficient, code. I think I can get multiply without too much trouble, but divide and mudulus will have to be in an external library. I wrote the divide code earlier, and it requires a function of 25 instructions with lots of loops. I don't see how allowing that to be inlined would be a good idea.
Sunday, July 10, 2011
I found and fixed a stack problem in tms9900-objdump. The buffer allocated for the source address in the COC instruction was too short. Attempting to display an instruction like "coc @>1234(r5), r0" caused a stack overflow, crashing objdump. Fixed now. I hope that's the last binutils bug, but I've got a feeling more are still lurking in there.
More good news, inline assembly works just fine. I really didn't expect any problems, but it's good to know for sure.
I used 32-bit left shifts as my test for inlining, and I've improved quite a bit over the GCC generated code. That was 18 instructions, and awfully clunky. Here's the new code:
ci r0, 16
jeq shiftl_16
mov r2, r4
sla r1, r0
sla r2, r0
neg r0
srl r4, r0
soc r4, r1
ci r0, -16
jgt shiftl_end
shiftl_16:
mov r2, r1
clr r2
shiftl_end:
This is twelve instructions, and about 200 cycles. I can't see any good way to significantly reduce this sequence.
I used 32-bit left shifts as my test for inlining, and I've improved quite a bit over the GCC generated code. That was 18 instructions, and awfully clunky. Here's the new code:
ci r0, 16
jeq shiftl_16
mov r2, r4
sla r1, r0
sla r2, r0
neg r0
srl r4, r0
soc r4, r1
ci r0, -16
jgt shiftl_end
shiftl_16:
mov r2, r1
clr r2
shiftl_end:
This is twelve instructions, and about 200 cycles. I can't see any good way to significantly reduce this sequence.
Wednesday, July 6, 2011
After my experience with word shifts, I decided to go back to byte shifts. The idea was to translate that effort and move on. Unfortunately, I noticed that GCC wants to do shifts as word quantities. I'm not sure why that is, but unfortunately, something like "(char)v>>=4" turns into:
sra r1, 8 * Convert from byte to word value
sra r1, 4 * Shift right the indicated amount
swpb r1 * Convert back to byte value
Ick. Using the existing shift-and-cast peephole optimizations I can reduce this to:
sra r1, 8 * Convert from byte to word value
sla r1, 4 * Shift right 4, shift left 8 to convert
Better, but not by much. I can add a peephole to change this sequence to a single shift, but I'll be back to this problem if the code is "(char)v = (v+1)>>4". The intervening add will defeat the peephole, bringing back the unnecessary byte-to-word and word-to-byte instructions.
From what I've seen so far, it looks like these promotions are expected behavior, and changing that would require getting into the guts of GCC again. I think I'll pass for now.
sra r1, 8 * Convert from byte to word value
sra r1, 4 * Shift right the indicated amount
swpb r1 * Convert back to byte value
Ick. Using the existing shift-and-cast peephole optimizations I can reduce this to:
sra r1, 8 * Convert from byte to word value
sla r1, 4 * Shift right 4, shift left 8 to convert
Better, but not by much. I can add a peephole to change this sequence to a single shift, but I'll be back to this problem if the code is "(char)v = (v+1)>>4". The intervening add will defeat the peephole, bringing back the unnecessary byte-to-word and word-to-byte instructions.
From what I've seen so far, it looks like these promotions are expected behavior, and changing that would require getting into the guts of GCC again. I think I'll pass for now.
Saturday, July 2, 2011
Apparently, the constraints for the shift count register was too loose, resulting in unexpected registers being used in the IRA step, and later flagged as errors. I was seeing things like "srl r1, r2" which is super wrong.
This was fixed by rewriting all the shift instructions to use "define_expand" to copy the shift count into R0, then do the shift. Two instruction forms were then written, one which only acccepts R0 as the shift register, and then another which only accepts contants. The optimizer then eliminates the unneeded move when constant shifts are used. I'm awfully happy about how that works now.
Even though I don't have a 32-bit shift yet, GCC is happy to compose a sequence using 16-bit instructions. Unfortunately, that sequence is pretty big. "long shift_ar(long r, int n) {return(r>>n);}" gets converted to:
shift_ar
mov r3, r4
andi r4, >10
abs r4
jeq L2
mov r3, r0
mov r1, r2
sra r2, 0
sra r1, >F
b *r11
jmp L6
L2
mov r1, r4
sla r4, >1
mov r3, r0
inv r0
sla r4, 0
mov r3, r0
srl r2, 0
soc r4, r2
sra r1, 0
b *r11
L6
I think I can do better, but there are other things to do right now.
This was fixed by rewriting all the shift instructions to use "define_expand" to copy the shift count into R0, then do the shift. Two instruction forms were then written, one which only acccepts R0 as the shift register, and then another which only accepts contants. The optimizer then eliminates the unneeded move when constant shifts are used. I'm awfully happy about how that works now.
Even though I don't have a 32-bit shift yet, GCC is happy to compose a sequence using 16-bit instructions. Unfortunately, that sequence is pretty big. "long shift_ar(long r, int n) {return(r>>n);}" gets converted to:
shift_ar
mov r3, r4
andi r4, >10
abs r4
jeq L2
mov r3, r0
mov r1, r2
sra r2, 0
sra r1, >F
b *r11
jmp L6
L2
mov r1, r4
sla r4, >1
mov r3, r0
inv r0
sla r4, 0
mov r3, r0
srl r2, 0
soc r4, r2
sra r1, 0
b *r11
L6
I think I can do better, but there are other things to do right now.
Monday, June 27, 2011
I've fixed the 16-bit multiply instructions. For some reason, when using a multiply instructions with registers as both operands, they get swapped. So instead of "A=A*B", I see "A=B*A". This is annoying, since if I constantly use one argument for all configurations in the assembly output, I could end up with "A=A*A" in some cases. Fortunately, this is easily handled. At some point, I need to find out where the swap comes from, but I'm OK with leaving this as it is for now.
Fun fact: GCC really, really doesn't want to use multiply. If one of the operands is constant, it'll prefer permutations of shifts and adds to calculate the result, even if this would be much more expensive for the TMS9900.
There's true oddness for the divide instructions. The instructions I had before work just fine, but GCC is selecting the opposite function than the one called for. For instance, if I write "A/B", I get back "A%B", and vice versa. The wierd thing is that this doesn't seem to be a bug in my code, but GCC is knowingly doing this for some reason. Research awaits.
OK, I'm losing my mind apparently. The DIV and MOD instructions are working just fine. Moving on...
That completes the list of known problems in the coverage tests. I may have missed some unsigned variants, so I'll need to double-check that. There might be a few more 32-bit instructions I could do too, but I think I'm good right now.
Fun fact: GCC really, really doesn't want to use multiply. If one of the operands is constant, it'll prefer permutations of shifts and adds to calculate the result, even if this would be much more expensive for the TMS9900.
There's true oddness for the divide instructions. The instructions I had before work just fine, but GCC is selecting the opposite function than the one called for. For instance, if I write "A/B", I get back "A%B", and vice versa. The wierd thing is that this doesn't seem to be a bug in my code, but GCC is knowingly doing this for some reason. Research awaits.
OK, I'm losing my mind apparently. The DIV and MOD instructions are working just fine. Moving on...
That completes the list of known problems in the coverage tests. I may have missed some unsigned variants, so I'll need to double-check that. There might be a few more 32-bit instructions I could do too, but I think I'm good right now.
Saturday, June 25, 2011
It looks like stack space is not being allocated for local variables when using -O0. This sounds familiar, so I'll need to go through the previous notes and see where things stand.
32-bit negate has been fixed, moving on.
The subtract opcode seems to be working wierd. The carry flag is being set opposite to what the carry-in would be. Odd, but not a big deal. Hopefully this is not an emulation bug, because I have no way of verifying this on real hardware right now.
I've also added optimizations for adding special constant values (-2,-1,0,1,2) in each of the words. Works fine.
32-bit negate has been fixed, moving on.
The subtract opcode seems to be working wierd. The carry flag is being set opposite to what the carry-in would be. Odd, but not a big deal. Hopefully this is not an emulation bug, because I have no way of verifying this on real hardware right now.
I've also added optimizations for adding special constant values (-2,-1,0,1,2) in each of the words. Works fine.
Wednesday, June 22, 2011
I need to test 32-bit subtract constant to see if it's correct. Seems OK during code inspection.
I also need to look at multiply, it doesn't seem to be handling type conversion and register allocation correctly.
Of course, divide will likely be broken too...
32-bit negate also looks shady.
int-to-long conversion: (r1 -> r2, r3)
mov r1, r3 4+14 2
seto r2 4+10 2
jlt $+4 4+(10..8) 2
clr r2 (0..4+10) 2
Total: (46..58) 8
Looks costly, let's try this:
mov r1, r3 4+14 2
mov r1, r2 4+14 2
sar r2, 15 4+12+2*15 2
Total: 82 6
Ugh, no. I'm sticking to what I had before.
So, after all that testing, I'm left with four problems (32-bit subtract, 32-bit negate, multiply, divide). That seems doable.
I also need to look at multiply, it doesn't seem to be handling type conversion and register allocation correctly.
Of course, divide will likely be broken too...
32-bit negate also looks shady.
int-to-long conversion: (r1 -> r2, r3)
mov r1, r3 4+14 2
seto r2 4+10 2
jlt $+4 4+(10..8) 2
clr r2 (0..4+10) 2
Total: (46..58) 8
Looks costly, let's try this:
mov r1, r3 4+14 2
mov r1, r2 4+14 2
sar r2, 15 4+12+2*15 2
Total: 82 6
Ugh, no. I'm sticking to what I had before.
So, after all that testing, I'm left with four problems (32-bit subtract, 32-bit negate, multiply, divide). That seems doable.
Tuesday, June 21, 2011
Coverage testing continues. I'm down to the add instructions, and I've decided to remove "c *r1+, *r1+" for "r1+=4". The math for this is below.
c *r1+, *r1+ clocks: 14+8+8=30 bytes: 2
ai r1, 4 clocks: 14+4 =18 bytes: 4
So the "c" form is half as big, but takes almost twice as long. I'm thinking now that speed is preferrable to size in the general case. For space-constrained code, this still shouldn't matter much, since +4 isn't likely to be used very often. So out it goes. I've left that code commented out in the MD file just in case I change my mind.
Here's some more unexpected stuff. I made a function which was just "return(memval++);", where memval was stored in memory. That resulted in this code:
mov @memval, r1 14+8 4
mov r1, r2 14 2
inc r2 10 2
mov r2, @memval 14+8 2
b *r11 8 2
Total: 76 clocks 12 bytes
That should have been:
inc @memval 10+8 4
mov @memval, r1 14+8 4
b *r11 8 2
Total: 48 clocks 10 bytes
Actually, now that I think about it, that's correct. The C code returns the current value of "memval", then increments it. By changing the C code to "return(++memval);", returning the incremented value, I get this:
mov @memval, r1 14+8 4
inc r1 10 2
mov r1, @memval 14+8 4
b *r11 8 2
Total: 62 clocks 12 bytes
It's a little closer to the expected code above, but still clunky. I can see why this code came out though. GCC is trying to minimize the number of bus transactions, and keep as much work in the registers as possible. This is not as important for the TMS9900, and we end up with suboptimal code.
I might be able to fix this by tweaking weights in the H file, but I can do that later.
I'm skipping the DIV instructions for now, I don't want to get sucked into that mess right now. For that matter, I'm skipping all 32-bit instructions too. I'll come back when 8 and 16 bit code is correct.
Down to the sign and zero extend instructions now. Remember the SB trick to clear the upper byte? That might be handy now.
srl r0, 8 12+2*8 2
Total: 28 2
swpb r0 10 2
sb r0, r0 14 2
Total: 24 4
Maybe not. I guess I'll leave this alone.
The shift-and-cast peepholes are broken after the recent GCC changes. So this is a good time to reevaluate the left shift and cast code. I have two options:
srl r0, N+8 12+2*8+2*N 2
Total: 28+2*N 2
srl r0, N 12+2*N 2
swpb r0 10 2
Total: 22+2*N 4
I guess I'm sticking with a single instruction then. There is a break in the pattern for a left shift of one:
a r1, r1 14 2
swpb r1 10 2
This is 24 cycles, compared to 30 cycles for a single instruction, which doesn't look too bad except that there is an additional 4 cycles imposed for reading an instruction from memory. The revised numbers would then be 32 and 34, a lot closer, but still slightly faster. I'll leave that alone for now.
Next up: optimized byte initializers.
c *r1+, *r1+ clocks: 14+8+8=30 bytes: 2
ai r1, 4 clocks: 14+4 =18 bytes: 4
So the "c" form is half as big, but takes almost twice as long. I'm thinking now that speed is preferrable to size in the general case. For space-constrained code, this still shouldn't matter much, since +4 isn't likely to be used very often. So out it goes. I've left that code commented out in the MD file just in case I change my mind.
Here's some more unexpected stuff. I made a function which was just "return(memval++);", where memval was stored in memory. That resulted in this code:
mov @memval, r1 14+8 4
mov r1, r2 14 2
inc r2 10 2
mov r2, @memval 14+8 2
b *r11 8 2
Total: 76 clocks 12 bytes
That should have been:
inc @memval 10+8 4
mov @memval, r1 14+8 4
b *r11 8 2
Total: 48 clocks 10 bytes
Actually, now that I think about it, that's correct. The C code returns the current value of "memval", then increments it. By changing the C code to "return(++memval);", returning the incremented value, I get this:
mov @memval, r1 14+8 4
inc r1 10 2
mov r1, @memval 14+8 4
b *r11 8 2
Total: 62 clocks 12 bytes
It's a little closer to the expected code above, but still clunky. I can see why this code came out though. GCC is trying to minimize the number of bus transactions, and keep as much work in the registers as possible. This is not as important for the TMS9900, and we end up with suboptimal code.
I might be able to fix this by tweaking weights in the H file, but I can do that later.
I'm skipping the DIV instructions for now, I don't want to get sucked into that mess right now. For that matter, I'm skipping all 32-bit instructions too. I'll come back when 8 and 16 bit code is correct.
Down to the sign and zero extend instructions now. Remember the SB trick to clear the upper byte? That might be handy now.
srl r0, 8 12+2*8 2
Total: 28 2
swpb r0 10 2
sb r0, r0 14 2
Total: 24 4
Maybe not. I guess I'll leave this alone.
The shift-and-cast peepholes are broken after the recent GCC changes. So this is a good time to reevaluate the left shift and cast code. I have two options:
srl r0, N+8 12+2*8+2*N 2
Total: 28+2*N 2
srl r0, N 12+2*N 2
swpb r0 10 2
Total: 22+2*N 4
I guess I'm sticking with a single instruction then. There is a break in the pattern for a left shift of one:
a r1, r1 14 2
swpb r1 10 2
This is 24 cycles, compared to 30 cycles for a single instruction, which doesn't look too bad except that there is an additional 4 cycles imposed for reading an instruction from memory. The revised numbers would then be 32 and 34, a lot closer, but still slightly faster. I'll leave that alone for now.
Next up: optimized byte initializers.
Monday, June 20, 2011
I'm glad I'm doing these coverage tests, since I've just found another problem with shift operations. Performing "(char)X<<=2" results in this sequence:
sra r1, 8 14+2*8 cycles 2 bytes
sla r1, 2 14+2*2 cycles 2 bytes
swpb r1 10 cycles 2 bytes
Total: 54 cycles, 6 bytes
Ick.
This has been fixed to match the unsigned shift code:
andi r1, >FF00 14+4 cycles 4 bytes
sla r1, 2 14+2*2 cycles 2 bytes
Total: 36 cycles, 6 bytes
Much better.
There's also a problem with AND, for some reason "a&b" is being converted to:
inv r1
szc r1, r1
The value stored in r2 is lost. It turns out that this instruction is generated by define_split. It converts a single AND to a NOT, AND_NOT sequence. Unfortunately, there was a typo in the extra step. Two of the three intermediate expressions were equivalent. Of course, those were the two passed on to define_insn, resulting in the lost register expression. All fixed now.
sra r1, 8 14+2*8 cycles 2 bytes
sla r1, 2 14+2*2 cycles 2 bytes
swpb r1 10 cycles 2 bytes
Total: 54 cycles, 6 bytes
Ick.
This has been fixed to match the unsigned shift code:
andi r1, >FF00 14+4 cycles 4 bytes
sla r1, 2 14+2*2 cycles 2 bytes
Total: 36 cycles, 6 bytes
Much better.
There's also a problem with AND, for some reason "a&b" is being converted to:
inv r1
szc r1, r1
The value stored in r2 is lost. It turns out that this instruction is generated by define_split. It converts a single AND to a NOT, AND_NOT sequence. Unfortunately, there was a typo in the extra step. Two of the three intermediate expressions were equivalent. Of course, those were the two passed on to define_insn, resulting in the lost register expression. All fixed now.
I'm going through the machine description file, making sure the emitted code is correct. I've yet to act on the earlier notes.
Right now, I'm looking at left shifted byte quantities. The problem here is that I can't just use a single left shift instruction because junk will just be shifted into the low order bits. It looks like I have two options:
Form 1: Mask off lower bits in word mode, then shift:
andi R1, >FF00 14+4 4 bytes
sla R1, N 12+2N 2 bytes
Total: 30+2N 6 bytes
Form 2: Swap bytes, shift, then restore
swpb R0 10 2 bytes
sla R1, N 12+2N 2 bytes
swpb R0 10 2 bytes
Total: 32+2N 6 bytes
Form 1 is slightly faster, and is more correct. The condition flags are left in the proper state. Form 2 would result in random condition flags, since SWPB does not modify flags, and SLA would set the flags based on the random contents of the unused byte. So I'll be using form one for this stuff.
Here's a neat idea I had while looking at the opcodes. I can clear the upper byte by SBing at register with itself. This saves two clocks and two bytes:
sb R0, R0 14 cycles, 2 bytes
andi R0, >00FF 14+2 cycles, 4 bytes
I can't use this right now, but I added the idea here so I wouldn't forget.
Right now, I'm looking at left shifted byte quantities. The problem here is that I can't just use a single left shift instruction because junk will just be shifted into the low order bits. It looks like I have two options:
Form 1: Mask off lower bits in word mode, then shift:
andi R1, >FF00 14+4 4 bytes
sla R1, N 12+2N 2 bytes
Total: 30+2N 6 bytes
Form 2: Swap bytes, shift, then restore
swpb R0 10 2 bytes
sla R1, N 12+2N 2 bytes
swpb R0 10 2 bytes
Total: 32+2N 6 bytes
Form 1 is slightly faster, and is more correct. The condition flags are left in the proper state. Form 2 would result in random condition flags, since SWPB does not modify flags, and SLA would set the flags based on the random contents of the unused byte. So I'll be using form one for this stuff.
Here's a neat idea I had while looking at the opcodes. I can clear the upper byte by SBing at register with itself. This saves two clocks and two bytes:
sb R0, R0 14 cycles, 2 bytes
andi R0, >00FF 14+2 cycles, 4 bytes
I can't use this right now, but I added the idea here so I wouldn't forget.
Wednesday, June 15, 2011
So, just how costly is that "movb @(b+94)(r10), @(b+94)(r10)" instruction?
14+8+8 = 30 clocks, 6 bytes
We can do better than that.
mov @(b+94)(r10), r0 : 14+8 = 22 clocks, 4 bytes
That's about 25% faster and 30% smaller, with the added cost of a temp register. That's not bad.
I took a look at the machine description file, and I really need to go through it line by line. I've seen a few bugs in uncommon branches. Also, that file is a mess since I've tried so many different approaches. Additionally, the 32-bit instructions have not been exercised much and could stand to be looked at again.
Also, look at "ai" emitters for byte quantities. I'm seeing stuff like ">FFE9 * 256", this just looks wrong.
14+8+8 = 30 clocks, 6 bytes
We can do better than that.
mov @(b+94)(r10), r0 : 14+8 = 22 clocks, 4 bytes
That's about 25% faster and 30% smaller, with the added cost of a temp register. That's not bad.
I took a look at the machine description file, and I really need to go through it line by line. I've seen a few bugs in uncommon branches. Also, that file is a mess since I've tried so many different approaches. Additionally, the 32-bit instructions have not been exercised much and could stand to be looked at again.
Also, look at "ai" emitters for byte quantities. I'm seeing stuff like ">FFE9 * 256", this just looks wrong.
The first output file looks pretty good. The memory-to-memory copies I was worried about are all over the place now, so something I did must have fixed that problem. Go figure. Maybe the wierd register setup I had before discouraged that kind of optimization.
I see there's a need for a byte compare immediate instruction. I removed those peepholes since they caused problems, but it looks like I need to find a way to put something like them back.
An extra line is being emitted for "jump less than or equal". This won't affect the assembly, but looks bad.
Also, I should align the operands to start in the same column. Right now there's a single space between opcode and operands, which makes following things more difficult.
I should also try to find a better form for zero compares. I'm seeing lines like "movb @(b+94)(r10), @(b+94)(r10)" which looks mighty costly.
Amazingly, there's floating point in the 2K_chess code. There's a single constant buried in that mess.
Byte constants should use either hex or unsigned decimal. Currently we see signed decimal.
Gratuitous readelf output:
1404 lines of assembly
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000dda 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e0c 000dbc 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e0e 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000e4f 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000e4f 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 000fc0 000b10 10 7 146 4
[ 7] .strtab STRTAB 00000000 001ad0 00033a 00 0 0 1
Assuming everything is correct (seems to be), and before addressing the comments above, that's about 5% smaller code than I had before.
old: 1463 lines, 3668 .text bytes
new: 1404 lines, 3546 .text bytes
I see there's a need for a byte compare immediate instruction. I removed those peepholes since they caused problems, but it looks like I need to find a way to put something like them back.
An extra line is being emitted for "jump less than or equal". This won't affect the assembly, but looks bad.
Also, I should align the operands to start in the same column. Right now there's a single space between opcode and operands, which makes following things more difficult.
I should also try to find a better form for zero compares. I'm seeing lines like "movb @(b+94)(r10), @(b+94)(r10)" which looks mighty costly.
Amazingly, there's floating point in the 2K_chess code. There's a single constant buried in that mess.
Byte constants should use either hex or unsigned decimal. Currently we see signed decimal.
Gratuitous readelf output:
1404 lines of assembly
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000dda 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e0c 000dbc 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e0e 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000e4f 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000e4f 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 000fc0 000b10 10 7 146 4
[ 7] .strtab STRTAB 00000000 001ad0 00033a 00 0 0 1
Assuming everything is correct (seems to be), and before addressing the comments above, that's about 5% smaller code than I had before.
old: 1463 lines, 3668 .text bytes
new: 1404 lines, 3546 .text bytes
Tuesday, June 14, 2011
I've fixed the GO_IF_LEGITIMATE_ADDRESS macro to handle more complicated addresses, and removed some fake register instructions. Memory handling now looks much better.
I've confirmed that the @(1+2)R1 forms are only shown with R10, so these are offsets into structures stored on the stack, and would not be combined. This is because the two constants are distinct as far as GCC is concerned (base + offset). Knowing that, I can update the final address printing routine to combine these constants. Symbol offsets will have to be wrapped in parentheses.
I've confirmed that the @(1+2)R1 forms are only shown with R10, so these are offsets into structures stored on the stack, and would not be combined. This is because the two constants are distinct as far as GCC is concerned (base + offset). Knowing that, I can update the final address printing routine to combine these constants. Symbol offsets will have to be wrapped in parentheses.
Saturday, June 11, 2011
The "unrecognizable insn" problem is in fact caused by the memory constraints. This address construct slips by earlier checks, and it's during instruction recognition that things go bad. At this time, the recognition tree is walked and the last step is to confirm that the address form is valid. The GO_IF_LEGITIMATE_ADDRESS macro does not recognise this form, and so we see the "unrecognizable insn" error.
For some reason, constant memory expressions are not combined, and forms like @(1+2)(R1) or @(a+1)(R1) are seen late in the compilation process, causing mayhem.
For some reason, constant memory expressions are not combined, and forms like @(1+2)(R1) or @(a+1)(R1) are seen late in the compilation process, causing mayhem.
Friday, June 3, 2011
I compiled a section of 2k_chess to exercise the new code and see where things stand. Things went really well except for this:
unrecognizable insn:
(insn 1253 482 483 85 2k_chess.c:111 (set (reg:QI 2 r2)
(mem/s/j:QI (plus:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 64 [0x40])) [4 %sfp+64 S2 A16])
(symbol_ref:HI ("b"))) [0 b S1 A8])) -1 (nil))
This instruction should result in "movb @b+64(R10), R2".
This should be handled by the code I have now, but for some reason it's not. Maybe I've goofed on the memory constraints and they are too narrow, disallowing "symbol+offset" forms. I guess I've got some research to do.
unrecognizable insn:
(insn 1253 482 483 85 2k_chess.c:111 (set (reg:QI 2 r2)
(mem/s/j:QI (plus:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 64 [0x40])) [4 %sfp+64 S2 A16])
(symbol_ref:HI ("b")
This instruction should result in "movb @b+64(R10), R2".
This should be handled by the code I have now, but for some reason it's not. Maybe I've goofed on the memory constraints and they are too narrow, disallowing "symbol+offset" forms. I guess I've got some research to do.
Tuesday, May 31, 2011
I'm looking at the crash today. I suspect that this is during the RTL-to-instrcution conversion. I've added a new expression to the mix, so maybe that code can't handle it properly.
Yep, that's the case. GCC dies in final_scan_insn() on the subreg expression I added.
Maybe not, I traced the crash down to a debug statement I put in earlier, apparently there is no debug file for the final pass, and so the debug fprintf's cause a crash. I just keep shooting myself in the foot with this debug output.
There's still a few minor bugs, and the MOVQI pattern needs more work, but I think this has some promise. I may need to rewrite all instructions and redo all testing once MOVQI is working. Oh boy...
Yep, that's the case. GCC dies in final_scan_insn() on the subreg expression I added.
Maybe not, I traced the crash down to a debug statement I put in earlier, apparently there is no debug file for the final pass, and so the debug fprintf's cause a crash. I just keep shooting myself in the foot with this debug output.
There's still a few minor bugs, and the MOVQI pattern needs more work, but I think this has some promise. I may need to rewrite all instructions and redo all testing once MOVQI is working. Oh boy...
Saturday, May 28, 2011
I've been busy, and haven't really had time for TI stuff lately. I did have bits of time here and there to fiddle with GCC, and in that time I got the 159r.combine step working. So now I'm looking into 172r.ira.
The subreg instructions are removed somewhere in reload(), but since that's represents a huge amount of work, I've got some spelunking ahead of me.
I had a lot more trouble finding reload() than I thought. It's called everywhere, but is located in reload1.c. I'm noting that fact here to possibly save some time later.
The next step is final.c::cleanup_subreg_operands(), called from reload(). That seems to delete the subreg expressions.
simplify_rtx.c::simplify_subreg() is the ultimate cause, which is called by final.c::alter_subreg(), which also makes some changes. But it looks like those changes don't matter for me.
I modified simplify_subreg() to explicitly emit a subreg expression if a word-to-byte conversion is used. That RTL expression seems to be preserved until 201r.shorten. At that point, there's a segmentation fault in GCC. Getting closer...
The subreg instructions are removed somewhere in reload(), but since that's represents a huge amount of work, I've got some spelunking ahead of me.
I had a lot more trouble finding reload() than I thought. It's called everywhere, but is located in reload1.c. I'm noting that fact here to possibly save some time later.
The next step is final.c::cleanup_subreg_operands(), called from reload(). That seems to delete the subreg expressions.
simplify_rtx.c::simplify_subreg() is the ultimate cause, which is called by final.c::alter_subreg(), which also makes some changes. But it looks like those changes don't matter for me.
I modified simplify_subreg() to explicitly emit a subreg expression if a word-to-byte conversion is used. That RTL expression seems to be preserved until 201r.shorten. At that point, there's a segmentation fault in GCC. Getting closer...
Monday, March 28, 2011
Pretty much all attempts to get byte operations working in a general way have failed. I'm now forced to try rewriting the compiler to handle this case. I don't like this at all, since it will make forward portability a problem, but I don't see what else I can do.
All byte subregs are optmized away in 159r.combine
Check out combine.c::combine_instructions and try_combine
All byte subregs are optmized away in 159r.combine
Check out combine.c::combine_instructions and try_combine
Thursday, March 24, 2011
Sunday, March 20, 2011
OK, back to 8-bit registers. If you remember, the last problem I hit here was that I found an instruction which was failing to allocate a register. I mentioned that the problem was in the not_usable allocation mask.
To make things easier to explain, I've copied the instruction below.
2k_chess.c:123: error: unable to find a register to spill in class ‘REAL_REGS’
this is the insn:
(insn 740 739 742 126 2k_chess.c:113 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 6 r3 [322]) 1)) 70 {movqi} (expr_list:REG_DEAD(reg:HI 6 r3 [322])
(nil)))
This instruction was MOVQI, which should take any register as an operand. For some reason, GCC is convinced it needs to use the REAL_REGS class for pseudoreg r322. The problem is that we're trying to allocate a two-byte value, but we are limited by the class to only using one register.
reload failure for reload 1
Reloads for insn # 740
Reload 0: REAL_REGS, RELOAD_FOR_OUTPUT (opnum = 0)
reload_out_reg: (reg/v:QI 69 [ x ])
Reload 1: reload_in (HI) = (reg:HI 6 r3 [322])
reload_out (QI) = (reg/v:QI 69 [ x ])
REAL_REGS, RELOAD_OTHER (opnum = 1)
reload_in_reg: (reg:HI 6 r3 [322])
reload_out_reg: (reg/v:QI 69 [ x ])
I suspect the problem is in allocating a register for pseudoregister 322. The subreg is causing problems. I wish GCC would let me make a generic method to handle subregs. it would make my life so much easier.
To make things easier to explain, I've copied the instruction below.
2k_chess.c:123: error: unable to find a register to spill in class ‘REAL_REGS’
this is the insn:
(insn 740 739 742 126 2k_chess.c:113 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 6 r3 [322]) 1)) 70 {movqi} (expr_list:REG_DEAD(reg:HI 6 r3 [322])
(nil)))
This instruction was MOVQI, which should take any register as an operand. For some reason, GCC is convinced it needs to use the REAL_REGS class for pseudoreg r322. The problem is that we're trying to allocate a two-byte value, but we are limited by the class to only using one register.
reload failure for reload 1
Reloads for insn # 740
Reload 0: REAL_REGS, RELOAD_FOR_OUTPUT (opnum = 0)
reload_out_reg: (reg/v:QI 69 [ x ])
Reload 1: reload_in (HI) = (reg:HI 6 r3 [322])
reload_out (QI) = (reg/v:QI 69 [ x ])
REAL_REGS, RELOAD_OTHER (opnum = 1)
reload_in_reg: (reg:HI 6 r3 [322])
reload_out_reg: (reg/v:QI 69 [ x ])
I suspect the problem is in allocating a register for pseudoregister 322. The subreg is causing problems. I wish GCC would let me make a generic method to handle subregs. it would make my life so much easier.
Friday, March 18, 2011
I may have to admit defeat on the transfer register scheme. While it works great for simple programs, it makes a lot of problems for more complex ones. Also, I've seen where the set and use of the transfer register are far apart, which would prevent optimizing it away. I'd need a stateful process to clean up the output assembly. If I could do that I wouldn't need these elaborate schemes in the first place.
So I guess I'm back to using 8-bit registers again. Oh well.
So I guess I'm back to using 8-bit registers again. Oh well.
Thursday, March 17, 2011
OK, the initial modifications to allow the transfer register scheme are done, and I've verified that things are working pretty well so far. Observe.
Input:
char test1(int a)
{
return(a);
}
Output:
test1
mov r1, FAKE_TR
movb FAKE_TR, r1
b *r11
I've moved the transfer register to the head of the allocation list, so that should result in peephole-friendly output.
Unfortunately, 2k_chess.c crashes the compiler. Poop. This may be more involved than I thought.
Input:
char test1(int a)
{
return(a);
}
Output:
test1
mov r1, FAKE_TR
movb FAKE_TR, r1
b *r11
I've moved the transfer register to the head of the allocation list, so that should result in peephole-friendly output.
Unfortunately, 2k_chess.c crashes the compiler. Poop. This may be more involved than I thought.
Sunday, March 13, 2011
I've been kicking around the idea of using a single fake register to transfer byte and word data around. The idea is that real registers cannot copy bytes to word and so all conversions must use the fake register. This fake register cannot do any actual work, and we can optimize it away during the peepholes.
I don't know if this will work any better than the other attempts, but I've got a good feeling about it.
I don't know if this will work any better than the other attempts, but I've got a good feeling about it.
Saturday, March 12, 2011
After about a month and a half of 100-hour weeks, I can finally get back to a normal schedule. So that means getting back to the crash.
I'm looking at the not_usable mask during register allocation, I think that's messing me up.
It is. GCC is assuming a word-sized quantity where a byte-sized one is called for. I've got all these fake registers, and prohibit QI usage in non-fake regisers. Since GCC is confused about type and size, a register cannot be allocated.
I'm really getting frustrated about these byte-related issues.
I'm looking at the not_usable mask during register allocation, I think that's messing me up.
It is. GCC is assuming a word-sized quantity where a byte-sized one is called for. I've got all these fake registers, and prohibit QI usage in non-fake regisers. Since GCC is confused about type and size, a register cannot be allocated.
I'm really getting frustrated about these byte-related issues.
Tuesday, March 1, 2011
I found the cause of the repeated "inv" sequences seen below:
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
There was no commutative form for an expression like "A=A&(^B)", so GCC added extra "inv" instructions to fit the one form it had. By adding the commutative form "A=(^B)&A", GCC was able to recognize the redundant work and eliminate it. We now have much nicer output here:
szcb @M+1, r2
movb r2, @L
Unfortunately, this could be made smaller yet:
szcb @M+1, @L
This is the same reluctance to use memory-to-memory instructions I found earlier. I suppose I could make more peepholes for this, but that would explode the pattern count. I would need an optimization matching a pattern like "Rn=FUNC(); mem=Rn" for every instruction in the system. That's a lot of peepholes. At that point, it might be better to try to find a different way to encourace memory-to-memory operations.
Adding this same logic to word-sized not-and operations resulted in an error:
2k_chess.c:122: error: unable to find a register to spill in class ‘REAL_REGS’
this is the insn:
(insn 740 739 742 126 2k_chess.c:112 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 6 r3 [322]) 1)) 70 {movqi} (expr_list:REG_DEAD (reg:HI 6 r3 [322])
(nil)))
2k_chess.c:122: confused by earlier errors, bailing out
OK... I didn't see that coming. Also, R3 should be available since it would otherwise be free at the end of the instruction. What's in the area of this instruction?
(insn 739 738 740 126 2k_chess.c:112 (set (reg:HI 322)
(and:HI (not:HI (reg:HI 362 [ prephitmp.228 ]))
(reg:HI 322))) 40 {*not_andhi} (nil))
(insn 740 739 742 126 2k_chess.c:112 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 322) 1)) 70 {movqi} (expr_list:REG_DEAD (reg:HI 322)
(nil)))
Hmm, not good. That's the logic I just fixed.
Note to self: I modified reload1.c to use debug output, this will cause a crash later if not removed.
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
There was no commutative form for an expression like "A=A&(^B)", so GCC added extra "inv" instructions to fit the one form it had. By adding the commutative form "A=(^B)&A", GCC was able to recognize the redundant work and eliminate it. We now have much nicer output here:
szcb @M+1, r2
movb r2, @L
Unfortunately, this could be made smaller yet:
szcb @M+1, @L
This is the same reluctance to use memory-to-memory instructions I found earlier. I suppose I could make more peepholes for this, but that would explode the pattern count. I would need an optimization matching a pattern like "Rn=FUNC(); mem=Rn" for every instruction in the system. That's a lot of peepholes. At that point, it might be better to try to find a different way to encourace memory-to-memory operations.
Adding this same logic to word-sized not-and operations resulted in an error:
2k_chess.c:122: error: unable to find a register to spill in class ‘REAL_REGS’
this is the insn:
(insn 740 739 742 126 2k_chess.c:112 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 6 r3 [322]) 1)) 70 {movqi} (expr_list:REG_DEAD (reg:HI 6 r3 [322])
(nil)))
2k_chess.c:122: confused by earlier errors, bailing out
OK... I didn't see that coming. Also, R3 should be available since it would otherwise be free at the end of the instruction. What's in the area of this instruction?
(insn 739 738 740 126 2k_chess.c:112 (set (reg:HI 322)
(and:HI (not:HI (reg:HI 362 [ prephitmp.228 ]))
(reg:HI 322))) 40 {*not_andhi} (nil))
(insn 740 739 742 126 2k_chess.c:112 (set (reg/v:QI 69 [ x ])
(subreg:QI (reg:HI 322) 1)) 70 {movqi} (expr_list:REG_DEAD (reg:HI 322)
(nil)))
Hmm, not good. That's the logic I just fixed.
Note to self: I modified reload1.c to use debug output, this will cause a crash later if not removed.
Monday, February 28, 2011
I've been completely occupied by JITC testing at work, and haven't looked at TI stuff in almost a month. But I've got a little time right now.
It looks like GCC favors MEM-REG-MEM copies rather then MEM-MEM copies. I've noticed in testing that M-M copies are often split up, and M-R-M copies are never combined. This happens regardless of how instrcution costs are set up, or how the C code is constructed.
That makes sense, since most modern machines prefer this kind of operation. I just need to put a peephole in to fix this. I'm getting quite a few peepholes, which is normally the sign of bandaids over a deeper problem. However, since I can't see anything actually wrong at this point, and my target is old enough to be wierd by modern standards, I'll try not to worry about that too much.
$ wc -l 2k_chess.s
Before peephole: 1462
After peephole: 1440
So I optimized out 22 instructions or 1.5% of the line count. Not very impressive, but it was easy to get.
It looks like GCC favors MEM-REG-MEM copies rather then MEM-MEM copies. I've noticed in testing that M-M copies are often split up, and M-R-M copies are never combined. This happens regardless of how instrcution costs are set up, or how the C code is constructed.
That makes sense, since most modern machines prefer this kind of operation. I just need to put a peephole in to fix this. I'm getting quite a few peepholes, which is normally the sign of bandaids over a deeper problem. However, since I can't see anything actually wrong at this point, and my target is old enough to be wierd by modern standards, I'll try not to worry about that too much.
$ wc -l 2k_chess.s
Before peephole: 1462
After peephole: 1440
So I optimized out 22 instructions or 1.5% of the line count. Not very impressive, but it was easy to get.
Wednesday, February 9, 2011
I've been away from TI stuff for a while, work and home life were pretty hectic and they take priority. I've got another month of madness coming up, but I should have some TI time available for a little bit.
That 2K chess program has a lot of goodies hiding in it. It's exposed quite a few weaknesses in the compiler, and it's small enough to be manageable if I need to isolate some bits.
Things I've found so far:
1) The fake registers show up in output code. That's bad.
inv FAKE_R0_LOW
movb r1, @73(r10)
mov @94(r10), r2
szc FAKE_R0_LOW, r2
2) Also, there's a form that looks like "copy memory byte to sign extended register" which is a convoluted mess.
movb @105(r10), r2
swpb r2
mov FAKE_R2_LOW, r2
swpb r2
sra r2, 8
3) Additionally, all memory-to-memory copies get needlessly split up into two instructions.
movb @48(r10), r7
movb r7, @5(r5)
4) There's this mess
movb r3, r3
swpb r3
sla r3, >8
mov r3, r5
5) What the heck is this supposed to be?
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
6) Here's an opportunity to remove an extra "swpb". If either R1 or R2 is dead, we can reorder operations and remove an instruction.
mov r1, r2
swpb r2
movb r2, @L
swpb r2
I'll be starting out on problem three, the memory-to-memory copies. That should not be happening, and should be easy to find. (I'm sure I'll regret saying that...)
RTL of a typical example:
(insn 1755 795 796 146 2k_chess.c:117 (set (reg:QI 14 r7)
(mem/c:QI (plus:HI (reg/f:HI 20 r10)
(const_int 48 [0x30])) [4 %sfp+48 S1 A8])) 68 {movqi} (nil))
(insn 796 1755 797 146 2k_chess.c:117 (set (mem/s:QI (plus:HI (reg:HI 10 r5)
(const_int 5 [0x5])) [0.Y+0 S1 A8])
(reg:QI 14 r7)) 68 {movqi} (nil))
This gets split from a single instruction in 172.ira
From 168.asmcons:
(insn 796 795 797 139 2k_chess.c:117 (set (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0.Y+0 S1 A8])
(reg/v:QI 66 [ Y ])) 68 {movqi} (nil))
From 172.ira:
Reloads for insn # 796
Reload 0: reload_in (HI) = (reg/v/f:HI 64 [ a ])
ALL_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 0)
reload_in_reg: (reg/v/f:HI 64 [ a ])
reload_reg_rtx: (reg:HI 10 r5)
Reload 1: reload_out (QI) = (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0.Y+0 S1 A8])
NO_REGS, RELOAD_FOR_OUTPUT (opnum = 0), optional
reload_out_reg: (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0.Y+0 S1 A8])
Reload 2: reload_in (QI) = (reg/v:QI 66 [ Y ])
REAL_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:QI 66 [ Y ])
reload_reg_rtx: (reg:QI 14 r7)
-------------------
handy info for reading RTL dumps:
(insn ....
-------------------
What the heck is this supposed to be?
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
From 172.ira:
(insn 1776 857 859 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(mem/c/i:QI (const:HI (plus:HI (symbol_ref:HI ("M") [flags 0x2] l 0xb752a058 M>)
(const_int 1 [0x1]))) [2 M+1 S1 A8])) 68 {movqi} (nil))
(insn 859 1776 1304 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(not:QI (reg:QI 4 r2 [343]))) 47 {one_cmplqi2} (nil))
(insn 1304 859 1305 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(not:QI (reg:QI 4 r2 [343]))) 47 {one_cmplqi2} (nil))
(insn 1305 1304 1171 156 2k_chess.c:54 (set (mem/c/i:QI (symbol_ref:HI ("L") r_decl 0xb752a4d0 L>) [0 L+0 S1 A8])
(and:QI (mem/c/i:QI (symbol_ref:HI ("L")) [0 L+0
S1 A8])
(not:QI (reg:QI 4 r2 [343])))) 41 {*tms9900.md:870} (nil))
Derived from insruction in 128.expand
(insn 859 858 860 2k_chess.c:54 (set (reg:QI 343)
(not:QI (mem/c/i:QI (reg/f:HI 342) [2 M+1 S1 A8]))) -1 (nil))
(insn 860 859 0 2k_chess.c:54 (set (mem/c/i:QI (symbol_ref:HI ("L")) [0 L+0 S1 A8])
(and:QI (reg/v:QI 66 [ Y ])
(reg:QI 343))) -1 (nil))
That 2K chess program has a lot of goodies hiding in it. It's exposed quite a few weaknesses in the compiler, and it's small enough to be manageable if I need to isolate some bits.
Things I've found so far:
1) The fake registers show up in output code. That's bad.
inv FAKE_R0_LOW
movb r1, @73(r10)
mov @94(r10), r2
szc FAKE_R0_LOW, r2
2) Also, there's a form that looks like "copy memory byte to sign extended register" which is a convoluted mess.
movb @105(r10), r2
swpb r2
mov FAKE_R2_LOW, r2
swpb r2
sra r2, 8
3) Additionally, all memory-to-memory copies get needlessly split up into two instructions.
movb @48(r10), r7
movb r7, @5(r5)
4) There's this mess
movb r3, r3
swpb r3
sla r3, >8
mov r3, r5
5) What the heck is this supposed to be?
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
6) Here's an opportunity to remove an extra "swpb". If either R1 or R2 is dead, we can reorder operations and remove an instruction.
mov r1, r2
swpb r2
movb r2, @L
swpb r2
I'll be starting out on problem three, the memory-to-memory copies. That should not be happening, and should be easy to find. (I'm sure I'll regret saying that...)
RTL of a typical example:
(insn 1755 795 796 146 2k_chess.c:117 (set (reg:QI 14 r7)
(mem/c:QI (plus:HI (reg/f:HI 20 r10)
(const_int 48 [0x30])) [4 %sfp+48 S1 A8])) 68 {movqi} (nil))
(insn 796 1755 797 146 2k_chess.c:117 (set (mem/s:QI (plus:HI (reg:HI 10 r5)
(const_int 5 [0x5])) [0
(reg:QI 14 r7)) 68 {movqi} (nil))
This gets split from a single instruction in 172.ira
From 168.asmcons:
(insn 796 795 797 139 2k_chess.c:117 (set (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0
(reg/v:QI 66 [ Y ])) 68 {movqi} (nil))
From 172.ira:
Reloads for insn # 796
Reload 0: reload_in (HI) = (reg/v/f:HI 64 [ a ])
ALL_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 0)
reload_in_reg: (reg/v/f:HI 64 [ a ])
reload_reg_rtx: (reg:HI 10 r5)
Reload 1: reload_out (QI) = (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0
NO_REGS, RELOAD_FOR_OUTPUT (opnum = 0), optional
reload_out_reg: (mem/s:QI (plus:HI (reg/v/f:HI 64 [ a ])
(const_int 5 [0x5])) [0
Reload 2: reload_in (QI) = (reg/v:QI 66 [ Y ])
REAL_REGS, RELOAD_FOR_INPUT (opnum = 1)
reload_in_reg: (reg/v:QI 66 [ Y ])
reload_reg_rtx: (reg:QI 14 r7)
-------------------
handy info for reading RTL dumps:
(insn
-------------------
What the heck is this supposed to be?
movb @M+1, r2
inv r2
inv r2
szcb r2, @L
From 172.ira:
(insn 1776 857 859 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(mem/c/i:QI (const:HI (plus:HI (symbol_ref:HI ("M") [flags 0x2]
(const_int 1 [0x1]))) [2 M+1 S1 A8])) 68 {movqi} (nil))
(insn 859 1776 1304 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(not:QI (reg:QI 4 r2 [343]))) 47 {one_cmplqi2} (nil))
(insn 1304 859 1305 156 2k_chess.c:54 (set (reg:QI 4 r2 [343])
(not:QI (reg:QI 4 r2 [343]))) 47 {one_cmplqi2} (nil))
(insn 1305 1304 1171 156 2k_chess.c:54 (set (mem/c/i:QI (symbol_ref:HI ("L")
(and:QI (mem/c/i:QI (symbol_ref:HI ("L")
S1 A8])
(not:QI (reg:QI 4 r2 [343])))) 41 {*tms9900.md:870} (nil))
Derived from insruction in 128.expand
(insn 859 858 860 2k_chess.c:54 (set (reg:QI 343)
(not:QI (mem/c/i:QI (reg/f:HI 342) [2 M+1 S1 A8]))) -1 (nil))
(insn 860 859 0 2k_chess.c:54 (set (mem/c/i:QI (symbol_ref:HI ("L")
(and:QI (reg/v:QI 66 [ Y ])
(reg:QI 343))) -1 (nil))
Friday, January 21, 2011
Posted at AtariAge:
That chess program was so insane, I needed to see what the compiler would do with it. Aside from a few unexpected fake register accesses (grr! fixed by hand), and failing to handle the giant lookup table (not suprising) it was pretty much drama-free.
The resulting code with -O2 optimizations: 1463 lines of assembly. Are there errors? Could be... But I have no motivation to compare against that C code. At first glance, it looks right.
Here's the readelf dump of the resulting object file:
eric@compaq:~/dev/tios/src/temp$ tms9900-readelf -S 2k_chess.o
There are 8 section headers, starting at offset 0xefc:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000e54 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e70 000da4 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e88 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000ec9 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000ec9 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 00103c 000b00 10 7 145 4
[ 7] .strtab STRTAB 00000000 001b3c 000333 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings)
I (info), L (link order), G (group), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
So that's 3668 bytes of code (.text section), and 3315 bytes of data (.data + .bss sections). Of course that data size is shy about 128 MB or so.
Still, I'm really impressed with how things are shaping up so far.
That chess program was so insane, I needed to see what the compiler would do with it. Aside from a few unexpected fake register accesses (grr! fixed by hand), and failing to handle the giant lookup table (not suprising) it was pretty much drama-free.
The resulting code with -O2 optimizations: 1463 lines of assembly. Are there errors? Could be... But I have no motivation to compare against that C code. At first glance, it looks right.
Here's the readelf dump of the resulting object file:
eric@compaq:~/dev/tios/src/temp$ tms9900-readelf -S 2k_chess.o
There are 8 section headers, starting at offset 0xefc:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000e54 00 AX 0 0 2
[ 2] .rela.text RELA 00000000 001e70 000da4 0c 6 1 4
[ 3] .data PROGBITS 00000000 000e88 000041 00 WA 0 0 2
[ 4] .bss NOBITS 00000000 000ec9 000cb2 00 WA 0 0 1
[ 5] .shstrtab STRTAB 00000000 000ec9 000031 00 0 0 1
[ 6] .symtab SYMTAB 00000000 00103c 000b00 10 7 145 4
[ 7] .strtab STRTAB 00000000 001b3c 000333 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings)
I (info), L (link order), G (group), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
So that's 3668 bytes of code (.text section), and 3315 bytes of data (.data + .bss sections). Of course that data size is shy about 128 MB or so.
Still, I'm really impressed with how things are shaping up so far.
Thursday, January 20, 2011
On Atariage, Astharot posted some crazy C code for a chess program, wondering if it could run on the TI. The code looks like it was from a coding contest attempting to squeeze something neat into 2K of source code. Since this looked insane, I figured it would be worth trying it aginst the compiler to see what popped out.
This initially caused compiler crashes, since I did all my testing aginst simple programs. It looks like I missed checking for more complex cases like constant memory expressions (ex: symbol+offset). There's also some lingering problems with fake registers in the 16-bit operations.
I think that since I allow any register to be used with those operations (with the caveat that all 16-bit values must start on an even register), GCC feels free to access the low byte by using the fake registers. Fixing this could be tricky. I don't think I can make a register class like I did for the 8-bit operations, since multiple registers are used for these values. This is also a problem for 32-bit values, so I don't want to use special case code to handle this. If I did that, the machine description file would at least double in size.
I'll have to think about this for a bit.
This initially caused compiler crashes, since I did all my testing aginst simple programs. It looks like I missed checking for more complex cases like constant memory expressions (ex: symbol+offset). There's also some lingering problems with fake registers in the 16-bit operations.
I think that since I allow any register to be used with those operations (with the caveat that all 16-bit values must start on an even register), GCC feels free to access the low byte by using the fake registers. Fixing this could be tricky. I don't think I can make a register class like I did for the 8-bit operations, since multiple registers are used for these values. This is also a problem for 32-bit values, so I don't want to use special case code to handle this. If I did that, the machine description file would at least double in size.
I'll have to think about this for a bit.
Sunday, January 16, 2011
After tons of frustration and research, I've finally made some good progress. I've given up on using 16-bit registers, and I'm back to fake 8-bit registers. That's the only way I can force GCC to track the size of the data value stored in a register. All attempts using 16-bit registers resulted in sadness and despair.
I've added a new register class for the real registers, and made sure that all byte operations can only use those registers. This guarantees that the fake registers will never appear in the compiled output. I've made an exception for the move byte operations. That is required for data to be copied out of the fake registers. I'm pretty happy about the whole thing.
I've also made a few optimizations. The array initilization optimizations I mentioned before have been implemented. We can now save two whole bytes of code per two bytes of array initilization. Horray!
I've also added code to potentially squeeze out an unnecessary move operation in some obscure cases. This won't really be noticable in most cases, but it makes me feel better.
Now that I'm past this nasty hurdle, I can get back to optimization and code coverage.
I've added a new register class for the real registers, and made sure that all byte operations can only use those registers. This guarantees that the fake registers will never appear in the compiled output. I've made an exception for the move byte operations. That is required for data to be copied out of the fake registers. I'm pretty happy about the whole thing.
I've also made a few optimizations. The array initilization optimizations I mentioned before have been implemented. We can now save two whole bytes of code per two bytes of array initilization. Horray!
I've also added code to potentially squeeze out an unnecessary move operation in some obscure cases. This won't really be noticable in most cases, but it makes me feel better.
Now that I'm past this nasty hurdle, I can get back to optimization and code coverage.
Sunday, January 9, 2011
So, I'm trying to get better tuncations. Let's see what GCC is doing right now.
I'm looking at this line:
unsigned char a = (val & 0x0F) + '0';
Which eventually turns into this: (R1=val, R13=a)
mov r1, r3
andi r3, >F
ai r3, >30
mov r3, @14(r10)
movb @15(r10), r13
This is the RTL for the final "mov-movb" bit of this sequence.
128r.expand
(insn 38 37 0 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) -1 (nil))
133r.vregs
(insn 38 37 39 4 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (nil))
150r.bypass
(insn 38 37 39 3 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (expr_list:REG_DEAD (reg:HI 51)
(nil)))
This is unchanged till 171r.subregs_of_mode_init, which makes sense I suppose. At that point this gets split into the two move instructions. Unfortunately, there is nothing obvious in here which would prevent using memory for transfers.
Just for kicks, I removed the CANNOT_CHANGE_MODE_CLASS macro, let's see what happens. The fact that the TMSW9900 stores an 8-bit quantity in the high byte will be lost, but maybe I can trace what happens.
128r.expand
(insn 38 37 0 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) -1 (nil))
168r.asmcons
(insn 38 37 39 3 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (expr_list:REG_DEAD (reg:HI 51)
(nil)))
172r.ira:
(insn 38 37 81 3 printf.c:13 (set (reg/v:QI 4 r4 [orig:45 a ] [45])
(reg:QI 4 r4 [orig:51+1 ] [51])) 68 {movqi} (nil))
Once we get to initial register allocation, this instruction has been transformed into a NOP, since GCC thinks the byte quantity will be left in the low byte of R4.
So after a lot of testing, and trial-and-error, I think I need to admit defeat on trying to get int-to-char working properly for 16-bit hard registers. GCC cannot be convinced to handle QI mode differently for the subregs. Depending on how I set up the test code, the point at which the hard subreg gets lost changes. This implies that that assumption is build into a lot of places. I might be able to optimize away conversions through the mov-movb sequence, but I need to do testing to make sure the stack will be suitably reduced, and that GCC doesn't try to reuse the value in memory later. Remember, there are no REG_DEAD flags for memory.
It's been a long day and I'm frustrated. No more for today.
I'm looking at this line:
unsigned char a = (val & 0x0F) + '0';
Which eventually turns into this: (R1=val, R13=a)
mov r1, r3
andi r3, >F
ai r3, >30
mov r3, @14(r10)
movb @15(r10), r13
This is the RTL for the final "mov-movb" bit of this sequence.
128r.expand
(insn 38 37 0 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) -1 (nil))
133r.vregs
(insn 38 37 39 4 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (nil))
150r.bypass
(insn 38 37 39 3 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (expr_list:REG_DEAD (reg:HI 51)
(nil)))
This is unchanged till 171r.subregs_of_mode_init, which makes sense I suppose. At that point this gets split into the two move instructions. Unfortunately, there is nothing obvious in here which would prevent using memory for transfers.
Just for kicks, I removed the CANNOT_CHANGE_MODE_CLASS macro, let's see what happens. The fact that the TMSW9900 stores an 8-bit quantity in the high byte will be lost, but maybe I can trace what happens.
128r.expand
(insn 38 37 0 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) -1 (nil))
168r.asmcons
(insn 38 37 39 3 printf.c:13 (set (reg/v:QI 45 [ a ])
(subreg:QI (reg:HI 51) 1)) 68 {movqi} (expr_list:REG_DEAD (reg:HI 51)
(nil)))
172r.ira:
(insn 38 37 81 3 printf.c:13 (set (reg/v:QI 4 r4 [orig:45 a ] [45])
(reg:QI 4 r4 [orig:51+1 ] [51])) 68 {movqi} (nil))
Once we get to initial register allocation, this instruction has been transformed into a NOP, since GCC thinks the byte quantity will be left in the low byte of R4.
So after a lot of testing, and trial-and-error, I think I need to admit defeat on trying to get int-to-char working properly for 16-bit hard registers. GCC cannot be convinced to handle QI mode differently for the subregs. Depending on how I set up the test code, the point at which the hard subreg gets lost changes. This implies that that assumption is build into a lot of places. I might be able to optimize away conversions through the mov-movb sequence, but I need to do testing to make sure the stack will be suitably reduced, and that GCC doesn't try to reuse the value in memory later. Remember, there are no REG_DEAD flags for memory.
It's been a long day and I'm frustrated. No more for today.
Friday, January 7, 2011
Thursday, January 6, 2011
Monday, January 3, 2011
Wow, I've apparently hit the mother lode of horrible broken behavior here. Not only has this printf.c code triggered all the earlier problems, it's also caused an event I was afraid of for a while, but foolishly convinced myself would never happen.
If you remember, way back when I was working on function epilogues, I made a fake PC register and added all the support needed to use it. The fear at the time was what would happen if GCC decideded to use that fake register for something. In all my earlier testing, that never happened, and I convinced myself that things were good and moved on. Sadly, this round of testing has uncovered a case where that fake PC was being used.
I'm not allowed to mark the fake PC as a reserved due to its use in the epilogue. Doing that reservation causes an instant compiler crash. What I can do, however, is prevent GCC from putting that register in the allocation pool. That allows the fake register to be used for the epilogue, but nothing else. This is technically bad form, but given all the contortions needed so far to get the TMS9900 backend working, I'm not losing sleep over anything.
Long story short: the fake PC register can no longer be used for anything except epilogue work.
So now that that's out of the way, let's see if I can fix the fake register usage...
Here's the original C code, which converts larger values to A-F, "a_base" is either "x" or "X":
if(a > '9') a += a_base - '9' - 1;
This is compiled with -O1, which results in horrible code.
The resulting assembly ("a" is stored in the high byte of R13):
mov r13, r4 # Copy "a" to temp
mov r4, r13 # Copy temp back to a (for some reason)
swpb r13 # Copy to "a" to low byte of R13 (wrong setup for cb)
li r4, >39 * 256 # R4 = '9' in high byte
cb FAKE_R4_LOW, r4 # Insn 39. This is wrong for so many reasons...
RTL dumps of this instructions during compilation:
172r.ira: This looks pretty good
(insn 39 81 40 3 printf.c:14 (set (cc0)
(compare (reg/v:QI 26 r13 [orig:62 a ] [62])
(reg:QI 8 r4))) 7 {cmpqi} (nil))
185r.cprop_hardreg
insn 39: replaced reg 26 with 9
--- CLIP ---
(insn 39 81 40 3 printf.c:14 (set (cc0)
(compare (reg:QI 9 FAKE_R4_LOW [orig:62 a ] [62])
(reg:QI 8 r4))) 7 {cmpqi} (expr_list:REG_DEAD (reg:QI 8 r4)
(nil)))
In 185r.cprop_hardreg, the correct decision to use R13 was changed to use FAKE_R4. Why the heck did that happen?
As part of the 185 step, GCC attempts to reuse parts of old registers to try to eliminate some work. Unfortunately, I was missing a CANNOT_CHANGE_MODE_CLASS macro which would prevent invalid register use. I think it's OK now, but I'm done for today. Test tomorrow.
If you remember, way back when I was working on function epilogues, I made a fake PC register and added all the support needed to use it. The fear at the time was what would happen if GCC decideded to use that fake register for something. In all my earlier testing, that never happened, and I convinced myself that things were good and moved on. Sadly, this round of testing has uncovered a case where that fake PC was being used.
I'm not allowed to mark the fake PC as a reserved due to its use in the epilogue. Doing that reservation causes an instant compiler crash. What I can do, however, is prevent GCC from putting that register in the allocation pool. That allows the fake register to be used for the epilogue, but nothing else. This is technically bad form, but given all the contortions needed so far to get the TMS9900 backend working, I'm not losing sleep over anything.
Long story short: the fake PC register can no longer be used for anything except epilogue work.
So now that that's out of the way, let's see if I can fix the fake register usage...
Here's the original C code, which converts larger values to A-F, "a_base" is either "x" or "X":
if(a > '9') a += a_base - '9' - 1;
This is compiled with -O1, which results in horrible code.
The resulting assembly ("a" is stored in the high byte of R13):
mov r13, r4 # Copy "a" to temp
mov r4, r13 # Copy temp back to a (for some reason)
swpb r13 # Copy to "a" to low byte of R13 (wrong setup for cb)
li r4, >39 * 256 # R4 = '9' in high byte
cb FAKE_R4_LOW, r4 # Insn 39. This is wrong for so many reasons...
RTL dumps of this instructions during compilation:
172r.ira: This looks pretty good
(insn 39 81 40 3 printf.c:14 (set (cc0)
(compare (reg/v:QI 26 r13 [orig:62 a ] [62])
(reg:QI 8 r4))) 7 {cmpqi} (nil))
185r.cprop_hardreg
insn 39: replaced reg 26 with 9
--- CLIP ---
(insn 39 81 40 3 printf.c:14 (set (cc0)
(compare (reg:QI 9 FAKE_R4_LOW [orig:62 a ] [62])
(reg:QI 8 r4))) 7 {cmpqi} (expr_list:REG_DEAD (reg:QI 8 r4)
(nil)))
In 185r.cprop_hardreg, the correct decision to use R13 was changed to use FAKE_R4. Why the heck did that happen?
As part of the 185 step, GCC attempts to reuse parts of old registers to try to eliminate some work. Unfortunately, I was missing a CANNOT_CHANGE_MODE_CLASS macro which would prevent invalid register use. I think it's OK now, but I'm done for today. Test tomorrow.
Sunday, January 2, 2011
Disaster averted. This problem was caused by a bug in the GO_IF_LEGITIMATE_ADDRESS macro. The macro is supposed to check for valid addressing modes, but the problem was that there should be two variants of this macro, one for when it is not known if a pseudoreg is a real register, and one where it is known. I only had one macro, which effectively meant GCC did not care if real registers were ever used. that allowed the memory location to be used in a post-increment setting. All fixed now, back to testing.
Important note for later:
When adding code which depends on using debug options, don't be suprised when things segfault and crash if those options aren't used. I forgot to remove all the test code I put into df-scan.c, which caused mayhem and hours of digging when I started testing with the normal makefiles.
Don't do that again.
I found something disturbingly familiar. Remember the "sra r2, 8", then empty lines and gibberish? It's back. Unfortunately, I can't for the life of me remember what I did to fix that.
Problem text:
movb *r9+, r2
sra r2, 8
--- CLIP ---
mov r2, r2
Problem solved! There was a missing return in the "*movqihisx1" form. This seems to have had the same result as returning an uninitialized pointer. When GCC attempted to dereference it, all the junk was printed. I'll make sure no more of those pop up.
After more testing, I've found a C function which allows fake registers to be emitted into the output while using -O1. I'll need to look at this more closely tomorrow. I don't like the fact that old bugs seem to be popping up lately.
Important note for later:
When adding code which depends on using debug options, don't be suprised when things segfault and crash if those options aren't used. I forgot to remove all the test code I put into df-scan.c, which caused mayhem and hours of digging when I started testing with the normal makefiles.
Don't do that again.
I found something disturbingly familiar. Remember the "sra r2, 8", then empty lines and gibberish? It's back. Unfortunately, I can't for the life of me remember what I did to fix that.
Problem text:
movb *r9+, r2
sra r2, 8
--- CLIP ---
mov r2, r2
Problem solved! There was a missing return in the "*movqihisx1" form. This seems to have had the same result as returning an uninitialized pointer. When GCC attempted to dereference it, all the junk was printed. I'll make sure no more of those pop up.
After more testing, I've found a C function which allows fake registers to be emitted into the output while using -O1. I'll need to look at this more closely tomorrow. I don't like the fact that old bugs seem to be popping up lately.
Saturday, January 1, 2011
After much digging around, I found the instruction which is causing the crash:
(post_inc:HI (mem/c:HI (plus:HI (reg/f:HI 20 r10)
(const_int 12 [0xc])) [0 %sfp+12 S2 A16]))
GCC is expecting this to be a register:
(mem/c:HI (plus:HI (reg/f:HI 20 r10)
(const_int 12 [0xc])) [0 %sfp+12 S2 A16])
This is derived from the C line "p = format". Due to spilled register usage,
"p" is stored on the stack. So here's the evolution of tthat instruction before it breaks.
128r.expand: This is the initial definition of this intruction. This is just an RTL expression at this point.
(insn 30 29 0 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(reg/v/f:HI 69 [ format ])) -1 (nil))
152r.dse1: At this point, an instruction form has been chosen. In this case "movhi". GCC has also determined that the value of "format" will no longer be used.
(insn 30 28 61 2 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(reg/v/f:HI 69 [ format ])) 69 {movhi} (expr_list:REG_DEAD (reg/v/f:HI 69 [ format ])
(nil)))
159r.combine: Here, an invalid mode is selected. This would be something like "mov (format)+, (p)", but "format" is a memory location.
(insn 30 25 61 2 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(mem/f/c/i:HI (post_inc:HI (reg/v/f:HI 60 [ ap.34 ])) [0 format+0 S2 A16])) 69 {movhi} (expr_list:REG_INC (reg/v/f:HI 60 [ ap.34 ])
(nil)))
So, somewhere in the combine step, GCC went bonkers.
(post_inc:HI (mem/c:HI (plus:HI (reg/f:HI 20 r10)
(const_int 12 [0xc])) [0 %sfp+12 S2 A16]))
GCC is expecting this to be a register:
(mem/c:HI (plus:HI (reg/f:HI 20 r10)
(const_int 12 [0xc])) [0 %sfp+12 S2 A16])
This is derived from the C line "p = format". Due to spilled register usage,
"p" is stored on the stack. So here's the evolution of tthat instruction before it breaks.
128r.expand: This is the initial definition of this intruction. This is just an RTL expression at this point.
(insn 30 29 0 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(reg/v/f:HI 69 [ format ])) -1 (nil))
152r.dse1: At this point, an instruction form has been chosen. In this case "movhi". GCC has also determined that the value of "format" will no longer be used.
(insn 30 28 61 2 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(reg/v/f:HI 69 [ format ])) 69 {movhi} (expr_list:REG_DEAD (reg/v/f:HI 69 [ format ])
(nil)))
159r.combine: Here, an invalid mode is selected. This would be something like "mov (format)+, (p)", but "format" is a memory location.
(insn 30 25 61 2 printf.c:22 (set (reg/v/f:HI 66 [ p ])
(mem/f/c/i:HI (post_inc:HI (reg/v/f:HI 60 [ ap.34 ])) [0 format+0 S2 A16])) 69 {movhi} (expr_list:REG_INC (reg/v/f:HI 60 [ ap.34 ])
(nil)))
So, somewhere in the combine step, GCC went bonkers.
Subscribe to:
Posts (Atom)