I spent some more time on the decompiler. Nothing special, just fixing bugs and making little improvements here and there.
The biggest fix I made was to the instruction simplification code. This fixed some clutter when working with byte values.
When simplifying, subsets of the instruction tree are passed through filters to find ways to simplify it. Currently, I've got about 20 filters in place. These cover cases like adding zero to a value, or multiplying by 1. Basically, we replace a subset of an instruction with a simpler equvalent operation.
Here's an example:
We start with something like "var2 = ((var1 >> 8) | (var1 << 8)) >> 8". This shows up when we have an assembly sequence like "SWPB R1; MOVB R1, R2".
First, we distribute the last right shift: "var2 = ((var1 >> 16) | ((var1 << 8) >> 8)"
Then we recognize that the first right shift will always result in zero: "var2 = 0 | ((var1 << 8) >> 8)"
Next, or'ing with zero has no impact on the result: "var2 = (var1 << 8) >> 8"
Finally, we replace the nested shifts with a logical and: "var2 = var1 & 256"
As you can see, this results in an instruction which is much easier to understand and process.
Here's that same instruction in the context of a complete function. The function just writes to a VDP register.
ORI R1, >8000 # write to register
SWPB R1 # swap bytes in R1
MOVB R1, @>8C02 # write low byte (value) to VDP data port
SWPB R1 #
MOVB R1, @>8C02 # write high byte (reg ID) to VDP data port
B *R11 # return
Here's the decompiler's output after processing that function:
(var223 = (((R1 | 0x8000) >> 8) | ((R1 | 0x8000) << 8)))
((U8*)(0x8C02) = (var223 >> 8))
((U8*)(0x8C02) = (var223 & 255))
(goto (U16*)(R11))
At first, this doesn't seem especially impressive, but remember that the decompiler found this function without human guidance, using only the compiled binary as input.
Monday, December 4, 2017
Saturday, December 2, 2017
So I spent some more time working on the decompiler. The part I was working on was subgraph recognition. Each flow control structure has a unique sructure, and by recognizing these structures, we can identify code blocks and the control operations.
Here are several example graphs to help explain this better. Each letter represents a block of non-branching code. The arrows show the possible execution paths between these blocks.
Example 1: Do while loop
A->B->C
^--'
Execution starts at block A, proceeds to block B and then either loops back to A or proceeds to C. This graph is equivalent to the following pseudocode:
do {
A
} while(B);
C
Example 2: While loop
.-----v
A->B->C
^--'
This is similar to the do while loop, but the loop determination is in block A. Here's the equivalent pseudocode:
while(A) {
B
}
C
Example 3: If
A->B->C
'-----^
This is a subgraph of the while loop, so the detection logic needs to distinguish between them. Again the decision whether to execute block B is made in block A. In either case, execution proceeds to block C. Here's the equivalent pseudocode:
if(A) {
B
}
C
Example 4: If else
.-----v
A->B C->D
'-----^
I think you get the idea by now. Here's the equivalent pseudocode:
if(A) {
B
} else {
C
}
D
There's still tons to do for the decompiler, but getting this right is one of the central problems that needed to be solved. It's pretty neat to see this working.
Here are several example graphs to help explain this better. Each letter represents a block of non-branching code. The arrows show the possible execution paths between these blocks.
Example 1: Do while loop
A->B->C
^--'
Execution starts at block A, proceeds to block B and then either loops back to A or proceeds to C. This graph is equivalent to the following pseudocode:
do {
A
} while(B);
C
Example 2: While loop
.-----v
A->B->C
^--'
This is similar to the do while loop, but the loop determination is in block A. Here's the equivalent pseudocode:
while(A) {
B
}
C
Example 3: If
A->B->C
'-----^
This is a subgraph of the while loop, so the detection logic needs to distinguish between them. Again the decision whether to execute block B is made in block A. In either case, execution proceeds to block C. Here's the equivalent pseudocode:
if(A) {
B
}
C
Example 4: If else
.-----v
A->B C->D
'-----^
I think you get the idea by now. Here's the equivalent pseudocode:
if(A) {
B
} else {
C
}
D
There's still tons to do for the decompiler, but getting this right is one of the central problems that needed to be solved. It's pretty neat to see this working.
Sunday, November 19, 2017
It's been a long time since I've updated this, but I've still been busy doing TI stuff.
There's really no way to reconstruct the development details, but here's a summary of where things currently stand:
GCC and Binutils:
These are pretty mature, and haven't really needed much work. I did find a minor flaw handling RTWP in the assembler, but that probably doesn't warrant a patch release on it's own.
TI Disk OS:
This is an attempt to write a replacement Linux-like OS for the TI-99/4a. I've currently got process management, time functions, wait queues and a primitive device driver layer. There's about 5000 lines for everything.
I got stalled trying to finalize a design for the file system and started working on the decompiler. I haven't done any OS work for a few months now.
Decompiler:
I started working on a general-purpose decompiler, but it's still in early devlopment phase. The idea was to make a platform-dependant disassembler which converts binary to a platform-independant representation. The tool then takes that instruction graph and does register lifetime calculation and flow control recognition to identify variables and flow control structures (if, while, for, etc.) to output high-level pseudocode.
I currently have the disassembly, variable reconstruction and most of the flow control recognition completed. There's about 9000 lines of code written so far, so it's pretty far along, but there's probably about 50 to 60 percent of the work left to do.
Other stuff:
I finished a libc implementation in support of the OS project, which has been released. I also finished a floating-point emulation library, which has also been released. The emulator was written mainly for the challenge, but is still a useful bit of code.
Today's update:
I was working on the decompiler today and was focused on flow control recognition.
The idea is that code is broken down into blocks. Within each block, execution proceeds without branches. The relationships between the blocks is preserved, resulting in a graph of possible execution paths. By analyzing the structure, the flow control operations can be determined. Once this is known, the pseudocode can be properly formatted.
The first phase is to convert binary to a general representation.. The second phase identifies chunks of executable code based on static analysis of the instruction stream. The third phase was the focus of todays work, to convert these unstructured code chunks into the execution graph.
This was done by converting branch instructions to graph nodes and non-branching chunks of code to graph edges between the nodes. Now that this has been done, I can do flow control recognition by doing subgraph isomorphism tests. Once a subgraph has been identified, that flow control type is recorded for later and that subgraph is collapsed to a single edge. By repeating these isomorphism tests, we can identify nested flow control of arbitrary complexity.
I know this has been a little short of details, so I'll try to provide more in later updates.
There's really no way to reconstruct the development details, but here's a summary of where things currently stand:
GCC and Binutils:
These are pretty mature, and haven't really needed much work. I did find a minor flaw handling RTWP in the assembler, but that probably doesn't warrant a patch release on it's own.
TI Disk OS:
This is an attempt to write a replacement Linux-like OS for the TI-99/4a. I've currently got process management, time functions, wait queues and a primitive device driver layer. There's about 5000 lines for everything.
I got stalled trying to finalize a design for the file system and started working on the decompiler. I haven't done any OS work for a few months now.
Decompiler:
I started working on a general-purpose decompiler, but it's still in early devlopment phase. The idea was to make a platform-dependant disassembler which converts binary to a platform-independant representation. The tool then takes that instruction graph and does register lifetime calculation and flow control recognition to identify variables and flow control structures (if, while, for, etc.) to output high-level pseudocode.
I currently have the disassembly, variable reconstruction and most of the flow control recognition completed. There's about 9000 lines of code written so far, so it's pretty far along, but there's probably about 50 to 60 percent of the work left to do.
Other stuff:
I finished a libc implementation in support of the OS project, which has been released. I also finished a floating-point emulation library, which has also been released. The emulator was written mainly for the challenge, but is still a useful bit of code.
Today's update:
I was working on the decompiler today and was focused on flow control recognition.
The idea is that code is broken down into blocks. Within each block, execution proceeds without branches. The relationships between the blocks is preserved, resulting in a graph of possible execution paths. By analyzing the structure, the flow control operations can be determined. Once this is known, the pseudocode can be properly formatted.
The first phase is to convert binary to a general representation.. The second phase identifies chunks of executable code based on static analysis of the instruction stream. The third phase was the focus of todays work, to convert these unstructured code chunks into the execution graph.
This was done by converting branch instructions to graph nodes and non-branching chunks of code to graph edges between the nodes. Now that this has been done, I can do flow control recognition by doing subgraph isomorphism tests. Once a subgraph has been identified, that flow control type is recorded for later and that subgraph is collapsed to a single edge. By repeating these isomorphism tests, we can identify nested flow control of arbitrary complexity.
I know this has been a little short of details, so I'll try to provide more in later updates.
Subscribe to:
Posts (Atom)