I've been looking through the tms9900.c file, here are some random notes for later:
The SP currently points at the bottommost used register, but every other machine I know points at the topmost free address, I should probably chnage that. Shouldn't be a big deal, but may break backwards-compatability.
I fixed the df_ref_record crash. The problem was not respecting the difference between strict and non-strict validation in GO_IF_LEGITIMATE_ADDRESS. When in non-strict mode, all pseudoregisters are treated as if they were real registers. In strict mode, they are treated as memory locations.
Since this distinction was not made, the value (r10+18) was treated as a register, accepted by GO_IF_LEGITIMATE_ADDRESS, and things blew up later. Now, that construct is seen as invalid, and GCC refactors the expression to find an equivalent valid form.
Back to printf...
Friday, March 30, 2012
Wednesday, March 28, 2012
I recently upgraded my laptop to Ubuntu 11.10, and this has made a mess of the porting effort. Since I needed to relink against new libraries, Binutils and GCC need to be rebuilt. This shouldn't be a problem, but for some reason, the -Werror flag (treat all warnings as errors) is now respected, where it seemed to be ignored before. None of the warnings are in my code, so I don't feel bad disabling warnings.
So I've got to update the build directions for Binutils.
patching:
$ cd {path_to_original_files}
$ patch -p1 < patchfile
binutils:
$ ./configure --target tms9900 --prefix /home/eric/dev/tios/toolchain --disable-build-warnings
$ make all
$ make install
GCC:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install
So I've got to update the build directions for Binutils.
patching:
$ cd {path_to_original_files}
$ patch -p1 < patchfile
binutils:
$ ./configure --target tms9900 --prefix /home/eric/dev/tios/toolchain --disable-build-warnings
$ make all
$ make install
GCC:
$ ./configure --prefix /home/eric/dev/tios/toolchain --target=tms9900 --enable-languages=c
$ make all-gcc
$ make install
Sunday, March 25, 2012
I Found this during testing:
printf.c: In function ‘printf’:
printf.c:239: internal compiler error: in df_ref_record, at df-scan.c:2803
This was the faulting expression:
(mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])
Since this expression doesn't show up in the debug output, I need to backtrace to find out what happened here
df_ref_record (cl=DF_REF_REGULAR)
df_uses_record (in the POST_INC case)
RTL of offending instruction:
(set (reg/v:QI 2 r2 [orig:49 c ] [49])
(mem:QI (post_inc:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])) [0 S1 A8]))
It looks like this came from this instruction, seen in 168r.asmcons
(insn 39 168 41 5 printf.c:152 (set (reg/v:QI 49 [ c ])
(mem:QI (post_inc:HI (reg/v/f:HI 46 [ p.44 ])) [0 S1 A8])) 71 {movqi} (expr_list:REG_INC (reg/v/f:HI 46 [ p.44 ])
(nil)))
In a more readable form:
movb 18(r10+), r2
It took me a while to figure this out, but GCC is not trying to increment R10, but the expression (R10+18), which rightly fails.
The pseudoregisgter R46 is "p" (a pointer into the current string) which is used later, as is R10, which is the stack pointer. This means that the post-increment expression needs to be removed somehow.
I imagine if I used the "register" keyword, that would probably fix this problem, but this would just show up again sometime later.
I need to take a closer look at the PDP11 config to see how they dealt with the problem.
printf.c: In function ‘printf’:
printf.c:239: internal compiler error: in df_ref_record, at df-scan.c:2803
This was the faulting expression:
(mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])
Since this expression doesn't show up in the debug output, I need to backtrace to find out what happened here
df_ref_record (cl=DF_REF_REGULAR)
df_uses_record (in the POST_INC case)
RTL of offending instruction:
(set (reg/v:QI 2 r2 [orig:49 c ] [49])
(mem:QI (post_inc:HI (mem/c:HI (plus:HI (reg/f:HI 10 r10)
(const_int 18 [0x12])) [6 %sfp+10 S2 A16])) [0 S1 A8]))
It looks like this came from this instruction, seen in 168r.asmcons
(insn 39 168 41 5 printf.c:152 (set (reg/v:QI 49 [ c ])
(mem:QI (post_inc:HI (reg/v/f:HI 46 [ p.44 ])) [0 S1 A8])) 71 {movqi} (expr_list:REG_INC (reg/v/f:HI 46 [ p.44 ])
(nil)))
In a more readable form:
movb 18(r10+), r2
It took me a while to figure this out, but GCC is not trying to increment R10, but the expression (R10+18), which rightly fails.
The pseudoregisgter R46 is "p" (a pointer into the current string) which is used later, as is R10, which is the stack pointer. This means that the post-increment expression needs to be removed somehow.
I imagine if I used the "register" keyword, that would probably fix this problem, but this would just show up again sometime later.
I need to take a closer look at the PDP11 config to see how they dealt with the problem.
Saturday, March 24, 2012
I played around with the compiler trying to reduce the test program output and found a few things.
There is a note attached to the add instruction that the R3 regisgter in HI mode is unused after that instruction, so we could skip the "jnc, inc" sequence.
The check commonly used for the other machines, "find_reg_note(insn, REG_UNUSED, operands[0])", will not work here. There are two reasons for this. The first is that the operand is in SI mode, but the note is for HI mode. The second is that the function only checks for pointer equality, and does not compare the RTX expressions to which the pointers refer. I'd have to make a new find_reg_note function to do note checking.
If I just change the order of typecasting, all this goes away:
char test(long a) {return((char)a+'0');}
Becomes:
test
mov r2, r1
swpb r1
ai r1, >3000
b *r11
This is the ideal sequence I was looking for originally, so another best practice seems to be to to late promotion and early demotion of datatypes. That seems to result in better code.
So for now, I'll put compiler changes on hold, and just go back to printf.
There is a note attached to the add instruction that the R3 regisgter in HI mode is unused after that instruction, so we could skip the "jnc, inc" sequence.
The check commonly used for the other machines, "find_reg_note(insn, REG_UNUSED, operands[0])", will not work here. There are two reasons for this. The first is that the operand is in SI mode, but the note is for HI mode. The second is that the function only checks for pointer equality, and does not compare the RTX expressions to which the pointers refer. I'd have to make a new find_reg_note function to do note checking.
If I just change the order of typecasting, all this goes away:
char test(long a) {return((char)a+'0');}
Becomes:
test
mov r2, r1
swpb r1
ai r1, >3000
b *r11
This is the ideal sequence I was looking for originally, so another best practice seems to be to to late promotion and early demotion of datatypes. That seems to result in better code.
So for now, I'll put compiler changes on hold, and just go back to printf.
Friday, March 23, 2012
I'm tracing backwards through the call tree to find where this problem shows up. Call tree is shown below:
ira
reload
cleanup_subreg_operands
alter_subreg
simplify_subreg
Found it. The problem is in the change I made earlier in simplify-rtx. That change was to handle int-to-char conversions, but it doesn't handle long-to-char properly. It was finding the correct register to use after conversion, but the wrong mode, this was described earlier during initial investigation.
The fix was to first convert long-to-int to find the correct register, then int-to-char to complete the conversion.
Finally, I get this code, which is longer than strictly necessary
test
clr r3 # Clear upper word of temp long
mov r2, r4 # Copy lower word of input to temp
ai r4, >30 # Add '0' to temp
jnc $+>4 # Add carry bit to upper word of temp
inc r3
mov r4, r1 # Copy low word of temp into output register
swpb r1 # Move result into position for return
b *r11
A better result would be more this like:
test
ai r2, >30 # Add '0' to low word of input
mov r2, r1 # Copy result into output register
swpb r1 # Move result into position for return
b *r11
ira
reload
cleanup_subreg_operands
alter_subreg
simplify_subreg
Found it. The problem is in the change I made earlier in simplify-rtx. That change was to handle int-to-char conversions, but it doesn't handle long-to-char properly. It was finding the correct register to use after conversion, but the wrong mode, this was described earlier during initial investigation.
The fix was to first convert long-to-int to find the correct register, then int-to-char to complete the conversion.
Finally, I get this code, which is longer than strictly necessary
test
clr r3 # Clear upper word of temp long
mov r2, r4 # Copy lower word of input to temp
ai r4, >30 # Add '0' to temp
jnc $+>4 # Add carry bit to upper word of temp
inc r3
mov r4, r1 # Copy low word of temp into output register
swpb r1 # Move result into position for return
b *r11
A better result would be more this like:
test
ai r2, >30 # Add '0' to low word of input
mov r2, r1 # Copy result into output register
swpb r1 # Move result into position for return
b *r11
Thursday, March 22, 2012
Here's a simplified version of the instructions up to step 182r.peephole2 for the "long" type.
inputs: r1,r2=val, r3=a_base (unused)
insn 24: r3=0 \_ Set temp 32-bit value to zero
insn 25: r4=r3 /
insn 28: r4=r2 <- Copy low word into temp value
insn 12: r4&=0x0F <- Apply mask
insn 13: r3(32-bit)+=48 <- '0'=48 <- Adding constant to low word
insn 14: r1=(char)r4
insn 15: number_buf[0]=r1
insn 33: return
GCC is claiming that R3 is unused (this is true), and so any manipulation of R3 can be safely removed. GCC also seems to be expanding this to R4, as it's part of the temporary 32-bit value defined in insn 24 and 25.
That results in these instructions being deleted in step 182:
insn 13
insn 12
insn 28
insn 25
insn 24
We are left with R4, with the note that it's in SImode (32-bit value), somehow the notion that R4 is the low word is lost, and R5 is selected for the low word of the value. This results in the final code:
print_hex
mov r5, r1
swpb r1
movb r1, @number_buf
b *r11
Similar behavior was seen with this code as well:
char test(long a) {return(a+'0');}
test
mov r5, r1
swpb r1
b *r11
Testing will continue with this example, since it's much shorter.
I found that the problem happens during register allocation. In all prior steps, the RTX looks good, but at 172r.ira, the wrong register is chosen, which makes a mess of everything from that point on.
168r.asmcoms
insn 3: (HI)((SI)r24)[0] = r1
insn 4: (HI)((SI)r24)[1] = r2
insn 9: (SI)r26=(SI)r24+48
insn 10: (QI)r27=(QI)((SI)r26)[3]
insn 16: (QI)r1=(QI)r27
172r.ira
insn 3: r3=r1
insn 4: r4=r2
insn 9: (SI)r3+=48
insn 10: (QI)r1=(SI)4 <- this is wrong, should be (SI)r3
inputs: r1,r2=val, r3=a_base (unused)
insn 24: r3=0 \_ Set temp 32-bit value to zero
insn 25: r4=r3 /
insn 28: r4=r2 <- Copy low word into temp value
insn 12: r4&=0x0F <- Apply mask
insn 13: r3(32-bit)+=48 <- '0'=48 <- Adding constant to low word
insn 14: r1=(char)r4
insn 15: number_buf[0]=r1
insn 33: return
GCC is claiming that R3 is unused (this is true), and so any manipulation of R3 can be safely removed. GCC also seems to be expanding this to R4, as it's part of the temporary 32-bit value defined in insn 24 and 25.
That results in these instructions being deleted in step 182:
insn 13
insn 12
insn 28
insn 25
insn 24
We are left with R4, with the note that it's in SImode (32-bit value), somehow the notion that R4 is the low word is lost, and R5 is selected for the low word of the value. This results in the final code:
print_hex
mov r5, r1
swpb r1
movb r1, @number_buf
b *r11
Similar behavior was seen with this code as well:
char test(long a) {return(a+'0');}
test
mov r5, r1
swpb r1
b *r11
Testing will continue with this example, since it's much shorter.
I found that the problem happens during register allocation. In all prior steps, the RTX looks good, but at 172r.ira, the wrong register is chosen, which makes a mess of everything from that point on.
168r.asmcoms
insn 3: (HI)((SI)r24)[0] = r1
insn 4: (HI)((SI)r24)[1] = r2
insn 9: (SI)r26=(SI)r24+48
insn 10: (QI)r27=(QI)((SI)r26)[3]
insn 16: (QI)r1=(QI)r27
172r.ira
insn 3: r3=r1
insn 4: r4=r2
insn 9: (SI)r3+=48
insn 10: (QI)r1=(SI)4 <- this is wrong, should be (SI)r3
Wednesday, March 21, 2012
Ok, this is wierd. I was working on a hex printing routine, and was seeing some strange behavior, so I've isolated it below:
void print_hex(long val, char a_base)
{
char *p=number_buf;
unsigned char a = ((val & 0x0F) + '0');
*p++ = a;
}
This results in this assembly:
print_hex
mov r5, r1
swpb r1
movb r1, @number_buf
b *r11
If "val" is defined as an "int" all is well:
print_hex
li r2, >F
mov r2, r2 <-- Wierd, but OK I guess
inv r2
szc r2, r1
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11
I'm totally confused.
The wierd MOV instruction was from the "andhi3" pattern. The idea was to move data into the correct temporary register for the SZC instruction. This move was inserted every time SZC is used, even if it would be pointless, as in the example above.
I've reworked the word and byte "and" patterns to emit optimal code. As a side-effect, some of the gyrations needed to handle SZC have been reduced. Here's the new code for this function:
print_hex
andi r1, >F
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11
Much better. Now back to bizarro-ness.
The problem shows up in step 182r.peephole2, but I'm not sure why these instructions were deleted. I see that there are notes like "DCE: Deleting insn 25". DCE stands for Dead Code Elimination, but the code in question is not dead. Or at least I don't think it is.
This is the first deleted instruction:
(insn 13 12 14 2 printhex.c:17 (set (reg:SI 3 r3 [27])
(plus:SI (reg:SI 3 r3 [26])
(const_int 48 [0x30]))) 56 {addsi3} (expr_list:REG_UNUSED (reg:SI 3 r3 [27])
(nil)))
This would be equivalent to the "ai r1, >30" instruction in the working int case. For some reason, this code is marked as unused, and removing this instruction causes all prior code to be marked and deleted as well.
Now I just need to find out why GCC think this is unused.
void print_hex(long val, char a_base)
{
char *p=number_buf;
unsigned char a = ((val & 0x0F) + '0');
*p++ = a;
}
This results in this assembly:
print_hex
mov r5, r1
swpb r1
movb r1, @number_buf
b *r11
If "val" is defined as an "int" all is well:
print_hex
li r2, >F
mov r2, r2 <-- Wierd, but OK I guess
inv r2
szc r2, r1
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11
I'm totally confused.
The wierd MOV instruction was from the "andhi3" pattern. The idea was to move data into the correct temporary register for the SZC instruction. This move was inserted every time SZC is used, even if it would be pointless, as in the example above.
I've reworked the word and byte "and" patterns to emit optimal code. As a side-effect, some of the gyrations needed to handle SZC have been reduced. Here's the new code for this function:
print_hex
andi r1, >F
ai r1, >30
swpb r1
movb r1, @number_buf
b *r11
Much better. Now back to bizarro-ness.
The problem shows up in step 182r.peephole2, but I'm not sure why these instructions were deleted. I see that there are notes like "DCE: Deleting insn 25". DCE stands for Dead Code Elimination, but the code in question is not dead. Or at least I don't think it is.
This is the first deleted instruction:
(insn 13 12 14 2 printhex.c:17 (set (reg:SI 3 r3 [27])
(plus:SI (reg:SI 3 r3 [26])
(const_int 48 [0x30]))) 56 {addsi3} (expr_list:REG_UNUSED (reg:SI 3 r3 [27])
(nil)))
This would be equivalent to the "ai r1, >30" instruction in the working int case. For some reason, this code is marked as unused, and removing this instruction causes all prior code to be marked and deleted as well.
Now I just need to find out why GCC think this is unused.
More time spent working on printf. I found a case where the fake PC register snuck into the allocator by GCC attempting to hold the low part of a 32-bit value starting in R15.
This was fixed by changing the HARD_REGNO_MODE_OK macro to disallow 32-bit values from being stored in R15. This ensure that only 16-bit values can only be used in the fake PC register, and we already ensure that only happens as part of a function epilogue.
This was fixed by changing the HARD_REGNO_MODE_OK macro to disallow 32-bit values from being stored in R15. This ensure that only 16-bit values can only be used in the fake PC register, and we already ensure that only happens as part of a function epilogue.
Tuesday, March 20, 2012
Lucien found more problems.
The first is a result of breaking up his main.c file into two pieces. I don't have internet access right now, so I can't work on this. The other problem was this code omitting the test for KEY's value:
#define KEY (*(volatile char*)0x8375)
int test1(int rpt){
if(KEY!=0xFF && !rpt) return(0);
else return(1);
}
This was just an unexpected enforcement of type limits. The proposed check against 0xFF is beyond the type limits of KEY and so need never be checked, since it would always return false.
Fixed output:
test1
movb @>8375, r2
seto r3
cb r2, r3
jeq L7
clr r2
mov r1, r3
jne L8
mov r2, r1
L3
b *r11
jmp L9
L8
li r2, >1
mov r2, r1
jmp L3
L7
li r1, >1
b *r11
Smaller code size results from this however:
#define KEY (*(volatile char*)0x8375)
int test2(int rpt){
if(KEY==-1 || rpt==0) return(1);
else return(0);
}
test2
mov r1, r2
movb @>8375, r1
seto r3
cb r1, r3
jeq L6
clr r1
mov r2, r3
jeq L6
b *r11
jmp L8
L6
li r1, >1
b *r11
The first form has three exit points(KEY==-1, KEY!=-1&&rpt==0, else clause), and has three code fragments for setting the return values. The second form only has two exits (KEY==-1||rpt==0, else clause), and so has less code. I'll generalize this and assume that AND conditions generate more code that their equvalent OR tests. I should do more tests to confirm that, but I feel pretty good about that assumption.
In the process of investigating this, I saw that the recent fixes for subract broke a peephole optimization for comparisons against small constant values (+-2 and +-1). Fixed now.
I tought I might be able to get faster code for some byte compare conditions, but I was wrong. The idea was to use SWPB to convert to a word value and then use INC-like instructions to set the flags. This is a bad idea. More cycles would be wasted in the setup to properly set the upper byte than would be saved by not using CB. Poop.
The best I can do is to use CI with cleverly-chosen constants for inequality compares (>, >=, <, <=), and MOVB for compares with zero. All of which I'm already doing.
The first is a result of breaking up his main.c file into two pieces. I don't have internet access right now, so I can't work on this. The other problem was this code omitting the test for KEY's value:
#define KEY (*(volatile char*)0x8375)
int test1(int rpt){
if(KEY!=0xFF && !rpt) return(0);
else return(1);
}
This was just an unexpected enforcement of type limits. The proposed check against 0xFF is beyond the type limits of KEY and so need never be checked, since it would always return false.
Fixed output:
test1
movb @>8375, r2
seto r3
cb r2, r3
jeq L7
clr r2
mov r1, r3
jne L8
mov r2, r1
L3
b *r11
jmp L9
L8
li r2, >1
mov r2, r1
jmp L3
L7
li r1, >1
b *r11
Smaller code size results from this however:
#define KEY (*(volatile char*)0x8375)
int test2(int rpt){
if(KEY==-1 || rpt==0) return(1);
else return(0);
}
test2
mov r1, r2
movb @>8375, r1
seto r3
cb r1, r3
jeq L6
clr r1
mov r2, r3
jeq L6
b *r11
jmp L8
L6
li r1, >1
b *r11
The first form has three exit points(KEY==-1, KEY!=-1&&rpt==0, else clause), and has three code fragments for setting the return values. The second form only has two exits (KEY==-1||rpt==0, else clause), and so has less code. I'll generalize this and assume that AND conditions generate more code that their equvalent OR tests. I should do more tests to confirm that, but I feel pretty good about that assumption.
In the process of investigating this, I saw that the recent fixes for subract broke a peephole optimization for comparisons against small constant values (+-2 and +-1). Fixed now.
I tought I might be able to get faster code for some byte compare conditions, but I was wrong. The idea was to use SWPB to convert to a word value and then use INC-like instructions to set the flags. This is a bad idea. More cycles would be wasted in the setup to properly set the upper byte than would be saved by not using CB. Poop.
The best I can do is to use CI with cleverly-chosen constants for inequality compares (>, >=, <, <=), and MOVB for compares with zero. All of which I'm already doing.
Thursday, March 15, 2012
Posted on AtariAge:
Well, after a really long time without any signs of life from this project, it's patch time.
A big "thank you" goes out to Lucien. A lot of the updates here are a direct result of the effort he put into making Rush Hour. He did a great job wading through all the brokenness to make a functional game. Now it's time for everyone to benefit from that work.
New Binutils fixes in this release:
STST was incorrectly looking for two arguments
SBO, SBZ and TB incorrectly using constants
EQU'ed symbols sometimes replaced using wrong endianness
GCC fixes:
Fixed several word-to-byte conversion errors
Fixed "unrecognizable instruction" for zero comparison operations
Made optimizations for most comparison operations
Improved correctness of condition flag handling
Switch statements now work properly
Fixed divison and modulus, operands were used in wrong order
Fixed subtract, operands were occasionally used in wrong order
Fixed stack frame corruption when local variables are in use
Added optimizations for forms like (int Y)=((int)(char X))<
The patch and build procedures are the same as always. Development notes are on my blog for those who are interested.
Things are shaping up pretty well so far. (Yes it is taking forever, sorry about that.) I don't see any obvious holes to fill, or optimizations yet to do. At this point, I just need to exercize the compiler with larger programs and increase test coverage. If anyone finds a problem, or sees an area where improvements can be made, please let me know.
I'm continuing to work on related projects (disk management tool, libc library, documentation). There's still lots to do, so these updates will keep coming
Well, after a really long time without any signs of life from this project, it's patch time.
A big "thank you" goes out to Lucien. A lot of the updates here are a direct result of the effort he put into making Rush Hour. He did a great job wading through all the brokenness to make a functional game. Now it's time for everyone to benefit from that work.
New Binutils fixes in this release:
STST was incorrectly looking for two arguments
SBO, SBZ and TB incorrectly using constants
EQU'ed symbols sometimes replaced using wrong endianness
GCC fixes:
Fixed several word-to-byte conversion errors
Fixed "unrecognizable instruction" for zero comparison operations
Made optimizations for most comparison operations
Improved correctness of condition flag handling
Switch statements now work properly
Fixed divison and modulus, operands were used in wrong order
Fixed subtract, operands were occasionally used in wrong order
Fixed stack frame corruption when local variables are in use
Added optimizations for forms like (int Y)=((int)(char X))<
The patch and build procedures are the same as always. Development notes are on my blog for those who are interested.
Things are shaping up pretty well so far. (Yes it is taking forever, sorry about that.) I don't see any obvious holes to fill, or optimizations yet to do. At this point, I just need to exercize the compiler with larger programs and increase test coverage. If anyone finds a problem, or sees an area where improvements can be made, please let me know.
I'm continuing to work on related projects (disk management tool, libc library, documentation). There's still lots to do, so these updates will keep coming
Monday, March 12, 2012
I've been going through the compiled code looking for more opprotunities for optimizations. I found this sequence, which looks interesting:
(int y) = ((int)(charx))<<8
sra r2, 8 12+2*8+4=32
sla r2, >8 12+2*8+4=32
total: 64 clocks, 4 bytes
This could be replaced with:
andi r2, >FF00 14+8+4=26
total: 26 clocks, 4 bytes
more generally:
(int y) = ((int)(charx))<< N
sra r2, 8 12+2*8+4 = 32
sla r2, N 12+2*N+4 = 16+2*N
total: 48+2N clocks, 4 bytes
sla r2, N-8 12+2*N-2*8+4 = 2*N
andi r2, 0xFFFF< total: 26+2N clocks, 6 bytes
This looks pretty darn good, about 33% faster on average.
Truth table below:
N Original pattern Result Optimization
- ----------------- ----------------- ----
0 01234567.xxxxxxxx -> xxxxxxxx.01234567 swpb
1 01234567.xxxxxxxx -> xxxxxxx0.1234567x >>7
2 01234567.xxxxxxxx -> xxxxxx01.234567xx >>6
3 01234567.xxxxxxxx -> xxxxx012.34567xxx >>5
4 01234567.xxxxxxxx -> xxxx0123.4567xxxx >>4
5 01234567.xxxxxxxx -> xxx01234.567xxxxx >>3
6 01234567.xxxxxxxx -> xx012345.67xxxxxx >>2
7 01234567.xxxxxxxx -> x0123456.7xxxxxxx >>1
8 01234567.xxxxxxxx -> 01234567.xxxxxxxx nop
9 01234567.xxxxxxxx -> 1234567x.xxxxxxxx <<1
A 01234567.xxxxxxxx -> 234567xx.xxxxxxxx <<2
B 01234567.xxxxxxxx -> 34567xxx.xxxxxxxx <<3
C 01234567.xxxxxxxx -> 4567xxxx.xxxxxxxx <<4
D 01234567.xxxxxxxx -> 567xxxxx.xxxxxxxx <<5
E 01234567.xxxxxxxx -> 67xxxxxx.xxxxxxxx <<6
F 01234567.xxxxxxxx -> 7xxxxxxx.xxxxxxxx <<7
(int y) = ((int)(charx))<<8
sra r2, 8 12+2*8+4=32
sla r2, >8 12+2*8+4=32
total: 64 clocks, 4 bytes
This could be replaced with:
andi r2, >FF00 14+8+4=26
total: 26 clocks, 4 bytes
more generally:
(int y) = ((int)(charx))<< N
sra r2, 8 12+2*8+4 = 32
sla r2, N 12+2*N+4 = 16+2*N
total: 48+2N clocks, 4 bytes
sla r2, N-8 12+2*N-2*8+4 = 2*N
andi r2, 0xFFFF<
This looks pretty darn good, about 33% faster on average.
Truth table below:
N Original pattern Result Optimization
- ----------------- ----------------- ----
0 01234567.xxxxxxxx -> xxxxxxxx.01234567 swpb
1 01234567.xxxxxxxx -> xxxxxxx0.1234567x >>7
2 01234567.xxxxxxxx -> xxxxxx01.234567xx >>6
3 01234567.xxxxxxxx -> xxxxx012.34567xxx >>5
4 01234567.xxxxxxxx -> xxxx0123.4567xxxx >>4
5 01234567.xxxxxxxx -> xxx01234.567xxxxx >>3
6 01234567.xxxxxxxx -> xx012345.67xxxxxx >>2
7 01234567.xxxxxxxx -> x0123456.7xxxxxxx >>1
8 01234567.xxxxxxxx -> 01234567.xxxxxxxx nop
9 01234567.xxxxxxxx -> 1234567x.xxxxxxxx <<1
A 01234567.xxxxxxxx -> 234567xx.xxxxxxxx <<2
B 01234567.xxxxxxxx -> 34567xxx.xxxxxxxx <<3
C 01234567.xxxxxxxx -> 4567xxxx.xxxxxxxx <<4
D 01234567.xxxxxxxx -> 567xxxxx.xxxxxxxx <<5
E 01234567.xxxxxxxx -> 67xxxxxx.xxxxxxxx <<6
F 01234567.xxxxxxxx -> 7xxxxxxx.xxxxxxxx <<7
Sunday, March 11, 2012
I don't know what happened, but that error doesn't show up anymore. I didn't make any compiler changes, so that bug could happen again. Unfortunately (for bug hunting) there is a pass in GCC to shorten branches. That pass automatically reorganizes the code to make optimally shorter jumps. That means forcing a condition where this bug can occur is tricky.
I ran a bunch of tests to try to recreate this problem, but came up short. I might have to just wait until a repeatable test case is found. That makes me nervous, but I'm not sure what else to do.
I ran a bunch of tests to try to recreate this problem, but came up short. I might have to just wait until a repeatable test case is found. That makes me nervous, but I'm not sure what else to do.
Friday, March 9, 2012
I've decided to see how far I can get without installing new software.
The problem seems to always be related to stuff in show_score. I've changed that function to just do this:
void show_score() {
char s[7];
if(score) {
itoa(score,s);
itoa(score,s);}
}
show_score
ai r10, >FFF2 allocate 14 bytes
mov r10, r0
mov r11, *r0+ sp[0]=r11
mov r9, *r0+ sp[2]=r9
mov r13, *r0+ sp[4]=r13
mov @score, r1 r1=score
jeq L193 if score==0 goto L193
li r9, itoa r9=itoa
mov r10, r13 r13=sp
inct r13
mov r13, r2 r2=&sp[2] (should be &sp[6], r9, r13 lost)
bl *r9
mov @score, r1
mov r13, r2
bl *r9
L193
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
ai r10, >8
b *r11
It looks like the location of the stack variables is being calculated wrongly. Time to dig into that.
The problem is that the stack code replaced the local frame with the stack pointer plus offset, the offset was incorrectly calculated. The size of he saved registers were not taken into consideration. Now that's taken care of, things are much better.
I'm a terrible liar. Building from scratch resulted in this mess:
main.o: In function `show_score':
(.text+0x9e5): relocation truncated to fit: R_TMS9900_PC8 against `.text'
make: *** [all] Error 1
000009d0:
9d0: 02 2a ff f4 ai r10, >FFF4
9d4: c6 8b mov r11, *r10
9d6: ca 89 00 02 mov r9, @>0002(r10)
9da: c0 60 00 28 mov @>0028, r1
9de: 13 00 jeq 0
9e0: d1 6a 00 04 movb @>0004(r10), r5
9e4: 13 00 jeq 0
9e6: c2 4a mov r10, r9
9e8: 02 29 00 04 ai r9, >0004
9ec: c0 49 mov r9, r1
Here's that relocation:
000009e5 00000101 R_TMS9900_PC8 00000000 .text + ae6
This results in a displacement of 0x101, which won't fit in the JEQ instruction. This is disappointing. I'll have to find some way to distinguish short and long jumps and stick that in the compiler.
check out s390_shorten_branches for possible ideas also look for "branch" and "range" in the other machine config files.
The problem seems to always be related to stuff in show_score. I've changed that function to just do this:
void show_score() {
char s[7];
if(score) {
itoa(score,s);
itoa(score,s);}
}
show_score
ai r10, >FFF2 allocate 14 bytes
mov r10, r0
mov r11, *r0+ sp[0]=r11
mov r9, *r0+ sp[2]=r9
mov r13, *r0+ sp[4]=r13
mov @score, r1 r1=score
jeq L193 if score==0 goto L193
li r9, itoa r9=itoa
mov r10, r13 r13=sp
inct r13
mov r13, r2 r2=&sp[2] (should be &sp[6], r9, r13 lost)
bl *r9
mov @score, r1
mov r13, r2
bl *r9
L193
mov *r10+, r11
mov *r10+, r9
mov *r10+, r13
ai r10, >8
b *r11
It looks like the location of the stack variables is being calculated wrongly. Time to dig into that.
The problem is that the stack code replaced the local frame with the stack pointer plus offset, the offset was incorrectly calculated. The size of he saved registers were not taken into consideration. Now that's taken care of, things are much better.
I'm a terrible liar. Building from scratch resulted in this mess:
main.o: In function `show_score':
(.text+0x9e5): relocation truncated to fit: R_TMS9900_PC8 against `.text'
make: *** [all] Error 1
000009d0
9d0: 02 2a ff f4 ai r10, >FFF4
9d4: c6 8b mov r11, *r10
9d6: ca 89 00 02 mov r9, @>0002(r10)
9da: c0 60 00 28 mov @>0028, r1
9de: 13 00 jeq 0
9e0: d1 6a 00 04 movb @>0004(r10), r5
9e4: 13 00 jeq 0
9e6: c2 4a mov r10, r9
9e8: 02 29 00 04 ai r9, >0004
9ec: c0 49 mov r9, r1
Here's that relocation:
000009e5 00000101 R_TMS9900_PC8 00000000 .text + ae6
This results in a displacement of 0x101, which won't fit in the JEQ instruction. This is disappointing. I'll have to find some way to distinguish short and long jumps and stick that in the compiler.
check out s390_shorten_branches for possible ideas also look for "branch" and "range" in the other machine config files.
Thursday, March 8, 2012
I've gone through the Rush Hour code, and all the awkward workarounds have been backed out except for a problem with this line:
display_at(0,32-strlen(s),s);
For some reason this is being converted to:
display_at(0,strlen(s)-32,s);
If strlen is defined as an external funciton, or a variable is used instead of a constant, this works properly. So, lets follow the development of this code...
from 128r.expand:
(insn 23 22 24 7 emw_test.c:5 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) -1 (nil))
133r.vregs
(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (nil))
140r.gcse1
(insn 23 22 24 7 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (expr_list:REG_DEAD (reg:HI 26)
(nil)))
162r.regmove
(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 23 [ prephitmp.56 ]))) 62 {subhi3} (nil))
This is correct so far, but something changes in step 172 during register allocation.
from 172r.ira:
(insn 23 22 24 emw_test.c:5 (set (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(minus:HI (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(const_int 32 [0x20]))) 62 {subhi3} (nil))
This same pattern shows up even in this simple example:
int sub_seven_word_r(int a) {return(7-a);}
So it turns out I had a typo in the machine description. I mistakenly told GCC that subtraction is commutative, and that operands can be swapped if that will result in better code. However once I fixed that, I started seeing this for the simple examples:
sub_word_seven_reg
li r2, >7
s r1, r2
mov r2, r1
b *r11
sub_byte_one_reg
li r2, >100
sb r1, r2
movb r2, r1
b *r11
sub_byte_one_mem
li r1, >100
sb @memb_a, r1
movb r1, @memb_a
b *r11
sub_word_mem_mem2
mov @memw_b, r1
s @memw_a, r1
mov r1, @memw_a
b *r11
This could be made better:
sub_word_seven_reg
addi r1, -7
neg r1
b *r11
sub_byte_one_reg
li r2, >100
sb r2, r1
neg r1
b *r11
sub_byte_one_mem
li r2, >100
sb r2, @memb_a
neg @memb_a
b *r11
sub_word_mem_mem2
s @memw_b, @memw_a
neg @memw_a
b *r11
So at this point, all of Lucien's issues have been addressed, and Rush Hour has had all the workarounds replaced with correct code. Well, all except for the VDP code which was the first thing I looked at. I have no idea what's going on here, maybe it's a stack corruption problem.
The problem seen when the VDP workarounds are backed out is that after moving the first piece, any keypress which would move the cursor causes the program to restart. I'll probably need to install Classic99 to examine this stack issue.
display_at(0,32-strlen(s),s);
For some reason this is being converted to:
display_at(0,strlen(s)-32,s);
If strlen is defined as an external funciton, or a variable is used instead of a constant, this works properly. So, lets follow the development of this code...
from 128r.expand:
(insn 23 22 24 7 emw_test.c:5 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) -1 (nil))
133r.vregs
(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (nil))
140r.gcse1
(insn 23 22 24 7 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 26))) 62 {subhi3} (expr_list:REG_DEAD (reg:HI 26)
(nil)))
162r.regmove
(insn 23 22 24 6 emw_test.c:7 (set (reg:HI 23 [ prephitmp.56 ])
(minus:HI (const_int 32 [0x20])
(reg:HI 23 [ prephitmp.56 ]))) 62 {subhi3} (nil))
This is correct so far, but something changes in step 172 during register allocation.
from 172r.ira:
(insn 23 22 24 emw_test.c:5 (set (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(minus:HI (reg:HI 2 r2 [orig:23 prephitmp.56 ] [23])
(const_int 32 [0x20]))) 62 {subhi3} (nil))
This same pattern shows up even in this simple example:
int sub_seven_word_r(int a) {return(7-a);}
So it turns out I had a typo in the machine description. I mistakenly told GCC that subtraction is commutative, and that operands can be swapped if that will result in better code. However once I fixed that, I started seeing this for the simple examples:
sub_word_seven_reg
li r2, >7
s r1, r2
mov r2, r1
b *r11
sub_byte_one_reg
li r2, >100
sb r1, r2
movb r2, r1
b *r11
sub_byte_one_mem
li r1, >100
sb @memb_a, r1
movb r1, @memb_a
b *r11
sub_word_mem_mem2
mov @memw_b, r1
s @memw_a, r1
mov r1, @memw_a
b *r11
This could be made better:
sub_word_seven_reg
addi r1, -7
neg r1
b *r11
sub_byte_one_reg
li r2, >100
sb r2, r1
neg r1
b *r11
sub_byte_one_mem
li r2, >100
sb r2, @memb_a
neg @memb_a
b *r11
sub_word_mem_mem2
s @memw_b, @memw_a
neg @memw_a
b *r11
So at this point, all of Lucien's issues have been addressed, and Rush Hour has had all the workarounds replaced with correct code. Well, all except for the VDP code which was the first thing I looked at. I have no idea what's going on here, maybe it's a stack corruption problem.
The problem seen when the VDP workarounds are backed out is that after moving the first piece, any keypress which would move the cursor causes the program to restart. I'll probably need to install Classic99 to examine this stack issue.
Tuesday, March 6, 2012
I fixed a problem with division and modulus. Apparently I shouldn't be allowed to read data sheets anymore. The operands were swapped, and the wrong byte of the return value was returned as a result (A/B retuned B%A, A%B returned B/A).
What makes me sick is that I specifically went over all this earlier, and made a point of spending lots of time to make this right. I failed.
At least it seems to work now.
What makes me sick is that I specifically went over all this earlier, and made a point of spending lots of time to make this right. I failed.
At least it seems to work now.
Monday, March 5, 2012
I took a detour for a minute and decided to make what I thought was an easy change and add cases to the switch statement in the input loop. Now I've got problems. Backing out vdp copy code changes for now.
To start with, there were a lot of errors like "(.text+0x15dc): undefined reference to `.L388'". That was fixed by changing the case label format in the compiler. The common unnamned label looks like ".L192", but the "." symbol is not supported in the TI assembler. Earlier, I changed to the "L192" format, but I missed this usage. The mismatch resulted in all the link errors.
That done, all key checks in the switch no longer work. I need to analyse the jump table to see what the heck is going on here.
switch labels:
data L386 - default
data L387 - 'S', 8,83, 0x08,0x53
data L388 - 'D', 9,68, 0x09,0x44
data L389 - 'X',10,88, 0x0A,0x58
data L390 - 'E',11,69, 0x0B,0x45
data L391 - ' ',13,32, 0x0D,0x20
jump calculation:
movb r9, r1
ai r1, >F800 <-- R1-=8, (min case=8)
ci r1, >50FF <-- check for max case (0x58-8=0x50)
jle JMP_13
b @L386 <-- beyond case extent, use default
JMP_13
srl r1, 8 <-- convert to int, 0<=R1<=0x50
a r1, r1 <-- convert to byte offset
b @L392(r1) <-- jump to case
Jump table:
index 0,75 -> L387 -> S
index 1,60 -> L388 -> D
index 2,80 -> L389 -> X
index 3,61 -> L390 -> E
index 5,24 -> L391 -> space
all else -> L386
After looking at this code for a long time, I think I know what's going on.
It's the instruction "b @L392(r1)". I looked through the E/A manual and the 9900 data sheets, and as far as I can tell, the next instruction exectued wil be the one at L392+r1. This seems obvious, but I'm used to working with machines which would instead load the value at this address, then jump to the loaded address. I need to change that to:
mov @L392(r1), r1
b *r1
That means more digging into GCC. I'd also like to find the code which decides when to use a jump table. This example seems like it would be smaller with nested if's
Well, the switch vs. if code was not found, but the correct branch code has been implemented, and works just fine.
To start with, there were a lot of errors like "(.text+0x15dc): undefined reference to `.L388'". That was fixed by changing the case label format in the compiler. The common unnamned label looks like ".L192", but the "." symbol is not supported in the TI assembler. Earlier, I changed to the "L192" format, but I missed this usage. The mismatch resulted in all the link errors.
That done, all key checks in the switch no longer work. I need to analyse the jump table to see what the heck is going on here.
switch labels:
data L386 - default
data L387 - 'S', 8,83, 0x08,0x53
data L388 - 'D', 9,68, 0x09,0x44
data L389 - 'X',10,88, 0x0A,0x58
data L390 - 'E',11,69, 0x0B,0x45
data L391 - ' ',13,32, 0x0D,0x20
jump calculation:
movb r9, r1
ai r1, >F800 <-- R1-=8, (min case=8)
ci r1, >50FF <-- check for max case (0x58-8=0x50)
jle JMP_13
b @L386 <-- beyond case extent, use default
JMP_13
srl r1, 8 <-- convert to int, 0<=R1<=0x50
a r1, r1 <-- convert to byte offset
b @L392(r1) <-- jump to case
Jump table:
index 0,75 -> L387 -> S
index 1,60 -> L388 -> D
index 2,80 -> L389 -> X
index 3,61 -> L390 -> E
index 5,24 -> L391 -> space
all else -> L386
After looking at this code for a long time, I think I know what's going on.
It's the instruction "b @L392(r1)". I looked through the E/A manual and the 9900 data sheets, and as far as I can tell, the next instruction exectued wil be the one at L392+r1. This seems obvious, but I'm used to working with machines which would instead load the value at this address, then jump to the loaded address. I need to change that to:
mov @L392(r1), r1
b *r1
That means more digging into GCC. I'd also like to find the code which decides when to use a jump table. This example seems like it would be smaller with nested if's
Well, the switch vs. if code was not found, but the correct branch code has been implemented, and works just fine.
Sunday, March 4, 2012
I got ahold of version three of Rush Hour, which contains all Lucien's fixes, and is considered the final code.
A disk image was made of the compiled, unmodified code. Seems to work just fine. Now I can start removing his workarounds and restore the code he originally meant to write.
First step, removing the unneeded "mov r1, r1" from vdp_copy_from_sys.
Works, but the cursor is corrupted after moving a piece. Hmm.
Dump of original code:
00000000:
0: c0 41 mov r1, r1
2: c0 41 mov r1, r1
4: c0 41 mov r1, r1
6: c0 41 mov r1, r1
8: a0 c2 a r2, r3
a: c1 01 mov r1, r4
c: 06 c4 swpb r4
e: d8 04 8c 02 movb r4, @>8C02
12: 02 61 40 00 ori r1, >4000
16: d8 01 8c 02 movb r1, @>8C02
1a: 80 c2 c r2, r3
1c: 14 00 jhe 0
This seems to be inlined all over the place, so I need to look closely at all references.
A disk image was made of the compiled, unmodified code. Seems to work just fine. Now I can start removing his workarounds and restore the code he originally meant to write.
First step, removing the unneeded "mov r1, r1" from vdp_copy_from_sys.
Works, but the cursor is corrupted after moving a piece. Hmm.
Dump of original code:
00000000
0: c0 41 mov r1, r1
2: c0 41 mov r1, r1
4: c0 41 mov r1, r1
6: c0 41 mov r1, r1
8: a0 c2 a r2, r3
a: c1 01 mov r1, r4
c: 06 c4 swpb r4
e: d8 04 8c 02 movb r4, @>8C02
12: 02 61 40 00 ori r1, >4000
16: d8 01 8c 02 movb r1, @>8C02
1a: 80 c2 c r2, r3
1c: 14 00 jhe 0
This seems to be inlined all over the place, so I need to look closely at all references.
Friday, March 2, 2012
I've found the problem in my disk tool. The three-byte cluster record in the FIB block was being improperly constructed. This caused the program loader to get nonsense sectors, resulting in a crash.
I confirmed this with a known-good disk image. The files it contained were extracted, and a new image made with those files. That new image ran without problems.
I spent some more time doing cleanup and comments, and exercised the options a bit more to gain confidence in the results. The disk tool still needs some work though. For instance, data file handling is incomplete, handling disk-full conditions is iffy at best, and using the tool at the command line is awkward, with lots of easy-to-forget switches.
I confirmed this with a known-good disk image. The files it contained were extracted, and a new image made with those files. That new image ran without problems.
I spent some more time doing cleanup and comments, and exercised the options a bit more to gain confidence in the results. The disk tool still needs some work though. For instance, data file handling is incomplete, handling disk-full conditions is iffy at best, and using the tool at the command line is awkward, with lots of easy-to-forget switches.
Since I've totally forgotten what I was doing, I've decided to go over Rush Hour once again, looking for evidence of my shame.
Everything compiles without error with the latest changes, but I need to execute the new image to truly confirm this.
That means I need to make a disk image. Hopefully that tool was left in good shape.
Unfortunately, no. The resulting disk will not load in E/A, and crashes Disk Manager. That's bad.
Everything compiles without error with the latest changes, but I need to execute the new image to truly confirm this.
That means I need to make a disk image. Hopefully that tool was left in good shape.
Unfortunately, no. The resulting disk will not load in E/A, and crashes Disk Manager. That's bad.
Subscribe to:
Posts (Atom)