Monday, December 4, 2017

I spent some more time on the decompiler. Nothing special, just fixing bugs and making little improvements here and there.

The biggest fix I made was to the instruction simplification code. This fixed some clutter when working with byte values.

When simplifying, subsets of the instruction tree are passed through filters to find ways to simplify it. Currently, I've got about 20 filters in place. These cover cases like adding zero to a value, or multiplying by 1. Basically, we replace a subset of an instruction with a simpler equvalent operation.

Here's an example:

We start with something like "var2 = ((var1 >> 8) | (var1 << 8)) >> 8". This shows up when we have an assembly sequence like "SWPB R1; MOVB R1, R2".

First, we distribute the last right shift: "var2 = ((var1 >> 16) | ((var1 << 8) >> 8)"

Then we recognize that the first right shift will always result in zero: "var2 = 0 | ((var1 << 8) >> 8)"

Next, or'ing with zero has no impact on the result: "var2 = (var1 << 8) >> 8"

Finally, we replace the nested shifts with a logical and: "var2 = var1 & 256"

As you can see, this results in an instruction which is much easier to understand and process.

Here's that same instruction in the context of a complete function. The function just writes to a VDP register.

ORI     R1, >8000   # write to register
SWPB    R1          # swap bytes in R1
MOVB    R1, @>8C02  # write low byte (value) to VDP data port
SWPB    R1          #
MOVB    R1, @>8C02  # write high byte (reg ID) to VDP data port
B       *R11        # return


Here's the decompiler's output after processing that function:

(var223 = (((R1 | 0x8000) >> 8) | ((R1 | 0x8000) << 8)))
((U8*)(0x8C02) = (var223 >> 8))
((U8*)(0x8C02) = (var223 & 255))
(goto (U16*)(R11))


At first, this doesn't seem especially impressive, but remember that the decompiler found this function without human guidance, using only the compiled binary as input.

Saturday, December 2, 2017

So I spent some more time working on the decompiler. The part I was working on was subgraph recognition. Each flow control structure has a unique sructure, and by recognizing these structures, we can identify code blocks and the control operations.

Here are several example graphs to help explain this better. Each letter represents a block of non-branching code. The arrows show the possible execution paths between these blocks.

Example 1: Do while loop

A->B->C
^--'

Execution starts at block A, proceeds to block B and then either loops back to A or proceeds to C. This graph is equivalent to the following pseudocode:

do {
  A
} while(B);
C

Example 2: While loop

.-----v
A->B->C
^--'

This is similar to the do while loop, but the loop determination is in block A. Here's the equivalent pseudocode:

while(A) {
  B
}
C

Example 3: If

A->B->C
'-----^

This is a subgraph of the while loop, so the detection logic needs to distinguish between them. Again the decision whether to execute block B is made in block A. In either case, execution proceeds to block C. Here's the equivalent pseudocode:

if(A) {
  B
}
C

Example 4: If else

   .-----v
A->B  C->D
'-----^

I think you get the idea by now. Here's the equivalent pseudocode:

if(A) {
  B
} else {
  C
}
D

There's still tons to do for the decompiler, but getting this right is one of the central problems that needed to be solved. It's pretty neat to see this working.