Saturday, September 10, 2011

I've had a thought about these comparison patterns. If I make a pattern which only allows comparisons using registers, I can force GCC to use the efficient forms.

I'm removing the 32-bit comparisons to see what happens. Ideally, GCC will break this operation down into two comparisons. If that works, The .md file will be just a little bit smaller, which makes maintenance easier.

Here's an example with 32-bit values:
int compare_long_mem_eq_0(long *a) {return(*a == 0 ? 7: 9);}

Original output:
compare_long_reg_eq_0
mov r1, r3
mov r2, r4
ci r3, >0
jne $+>4
ci r4, >0
jeq L2
li r1, >9
b *r11
jmp L6
L2
li r1, >7
b *r11
L6
even

Results after removing "tstsi" and "cmpsi" patterns
compare_long_mem_eq_0
mov *r1, r2
soc @>2(r1), r2
mov r2, r1 <-- Not necessary, work on that later
jeq L51
li r1, >9
jmp L52
L51
li r1, >7
L52
b *r11

This is quite a bit better, and is the worst-looking output of all comparison forms. Most of the time, we can do the compare in one instruction. Typically, ORing the two words together.

Eliminating the 32-bit comparisons works really well, but causes the compiler to crash in some cases. It turns out that in the peephole optimization for comparing values to +-1 and +-2, I used a "general_operand" type, which allows memory references as well as registers. Unfortunately, I used "peephole2_reg_dead" to determine if that optimization should be used. That function explodes if called upon to check anything other than registers. If GCC tries to use one of those patterns when comparing a memory value, we will get a crash.

This was fixed by only allowing registers for the optimization pattern, and all is well. Honestly, that's what it should have been all along.

As a side effect of this work, the code that determines which instructions affect the condition flags has been rewritten from scratch. That code was originally taken from the PDP-11, and modified slightly. Now that I'm more familiar with the GCC internals, I can see where the flaws are. The new code is smaller, easier to follow as well as more correct.

Since the 32-bit comparison work went so well, I thought I'd monkey with the 16-bit comparisons. Instead of worrying about scratch registers, I'll just force all comparisons to be done in registers. Basically what will happen is that values will be copied into registers for the comparison, and later steps will realize that the move itself will set the comparison flags, resulting in clean looking output.

Sadly, that doesn't work out. Much more effort is done to set up the comparison than what I had before. Usually, there are several extra MOVs and no effort is made to use the additional optimizations. Oh well. What I have is still pretty good.

Thursday, September 8, 2011

On the AtariAge forums, Tursi mentioned that C99 (I think) placed an EVEN after all string constants, which annoyed him. I can see why they did that, since I did the same thing in GCC. I'm not going to get to that right now, but I wanted to make a note of that so I wouldn't forget.

Currently, I'm looking into more bugs Lucien2 found while writing Rush Hour.

There was a problem with an "unrecognizable instruction" which was a nice easy fix. The problem here is that there was a pattern for comparisons against zero (X == 0), which called for a scratch register. This extra register is useful to handle cases where we need to compare against a post-incremented pointer (*r1+), or indexed addresses (@5(r1)). In these cases we can emit an instruction like "mov *r1+, r0" or "mov @5(r1), r0" which is smaller, faster, and has no side-effects.

If the need for a zero compare is made early enough, this scratch register request is ignored, and everything is fine. If this happens later, as the result of an optimization, GCC looks for an exact match. This fails, because of the scratch register request, which doesn't exist at this point.

This was fixed by making a second, unnamed pattern which compares against zero, but uses no scratch register. In this case, we emit an instruction like "mov @5(r1), @5(r1)", and disallow post-incremented pointers. This is not as efficient, but will work just fine.

Wednesday, September 7, 2011

So I've been looking into this code, which is used to set a VDP address:

movb r9, @>8C02 <-- should be copying low byte
mov r9, r2
ori r2, >4000
movb r2, @>8C02

I turned on all debug output to dig into this sequence, and follow its development.

At some point during register alocation, this instruction is being deleted:
(insn 36 35 39 4 lucien2_5.c:18 (set (reg:QI 2 r2 [orig:40 i+1 ] [40])
(subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)) 72 {movqi} (nil))

I'm not sure why, that looks fine to me. Instead, GCC assumes the wrong byte usage, eventually resulting in "movb r9, @>8C02".

The subreg expression seems to be removed entirely, and is replaced with an instruction like "mov r2, r2", as seen below:

Reloads for insn # 36
Reload 0: reload_in (QI) = (reg:QI 9 r9)
REAL_REGS, RELOAD_FOR_INPUT (opnum = 1), can't combine
reload_in_reg: (subreg:QI (reg/v:HI 9 r9 [orig:26 i ] [26]) 1)
reload_reg_rtx: (reg:QI 2 r2 [orig:40 i+1 ] [40])

Later this instruction is deleted, and since the subreg expression is lost, we use the wrong byte. So I need to find where this is being done.

Call tree for problem location:
ira
reload
alter_reg
df_ref_change_reg_with_loc
df_ref_change_reg_with_loc_1
reload_as_needed
subst_reloads

OK, here's the problem. The correct subreg expression needed for proper byte handling is removed in subst_reloads, but that code does not make the determination to do the removal. There is a data structure containing operations to apply for each instruction which subst_reloads duitfully applies. I need to find the code which makes the decisions. Which means I need to start over again. Ugh.

Once agin:
find_reloads
push_reload

The decision to do register substitution is done in find_reloads, but that determination is made due do a default handler. As I understand it, if a set instruction is used, but the input argument is to be reloaded, and that reload has not yet been made, the input is replaced with the output argument in an attempt to later remove this instruction. I now need to find why GCC needs to replace the subreg argument, since that looks perfectly fine to me. That decision looks to be somewhere in push_reload.

Nope. That is done in find_reloads. This is the most obtuse code I've looked at in a long time. The find_reloads function by itself is over 2000 lines of twisty logic. It's taken almost two weeks just to answer what I thought was a simple question.

It turns out that if there is a subreg expression as an instruction operand, GCC forces that expression to be reloaded, with the hope that it can be removed later. I have added a check to not do that in the case where a byte-to-word or word-to-byte expression is used. This preserves the subreg expression until instruction output, where we can emit the correct code.

So, at long last, we get good-looking code for this sequence:

mov r9, r2
swpb r2
movb r2, @>8C02
mov r9, r2
ori r2, >4000
movb r2, @>8C02

I'm cleaning out the TONS of debug output and path tracing code, and moving on to the next problem.