I was looking a 32-bit division, and I was trying to find a way to take advantage of the 16-bit DIV instruction. Here's what I came up with:

X: numerator

Y: denominator

Q: ratio

N: 16-bit radix (2^16)

X/Y = Q

X = A*N+B

Y = C*N+D

N = 1<<16

Replace terms:

(A*N+B)/(C*N+D) = Q

B will be lost due to significant figures

Multiply denominator by ((1/C)/(1/C)), this eliminates 32-bit division:

(A*N)/((N+D/C)/C) = Q

Multiply numerator by ((1/(N+D/C))/(1/(N+D/C))):

((A*N)/((N+D/C))/C = Q

Multiply numerator by ((1/2)/(1/2)), this ensures all partial terms fit into a 16-bit quantity:

((A*N/2)/((N+D/C)/2))/C = Q

Decompose into partial terms:

V1 = D/C

V2 = N/2+V1/2

V3 = (A*N/2)/V2

P = V3/C

Due to the stackup of integer truncation, there will be rounding errors. Testing over the range of valid inputs shows that the result is accurate to +-1. Another step is required to fix this approximation.

Account for the approximation (Z is error due to truncation):

Q = P+Z

Replace Q with earlier equation

P+Z = A*N/(C*N+D)

Multiply both sides by (C*N+D):

A*N = (P*C*N+P*D)+(Z*C*N+Z*D)

Divide both sides by N, solve for Z.

Terms involving D are lost due to significant figures

A - P*C = Z*C

If Z<=0, any truncation error is covered by rounding error

If Z>0, the estimate "P" is one greater than the true result

So:

if(A > P*C), P := P-1

Untested assembly below. X Passed on [r1,r2], Y Passed on [r3,r4], result passed in [r1,r2], this assumes unsigned operands

# Cycles

mov r4, r5 # 14 : Copy D to temp register

clr r4 # 10 : Clear high word, prepare for division

div r3, r4 # 124 : R4 = C/D {V1}

srl r4, 1 # 14

ai r4, >1000 # 18 : R4 = N/2+V1/2 {V2}

mov r1, r5 # 14 : Save unmodified A for later

mov r1, r2 # 14

src r2, 1 # 14

ai r2, >1000 # 18

srl r1, 1 # 14 : [R1,R2] = A*N/2

div r1, r4 # 124 : R1 = (A*N/2)/V2 {V3}

mov r1, r2 # 14

clr r1 # 10

div r3, r1 # 124 : R1 = V3/C {P}

mpy r1, r3 # 52 : [R3,R4] = P*C

c r1, r5 # 14 : Compare A to P*C

jle +2 # 8 :

dec r1 # 10 :

mov r1, r2 # 14 : Move result into proper registers

clr r1 # 10 : [R1,R2] = P

total 634 clocks, 44 bytes (722 clocks including instruction loads)

This compares well to an earlier method using shifts and subtracts. That method uses a maximum of 7394 clocks and 50 bytes. (10082 clocks including instruction loads)

Assuming this all works out, this approach is fourteen times faster with a smaller footprint. I think I have a winner.

Subscribe to:
Post Comments (Atom)

## No comments:

## Post a Comment