Fast division (Terje Mathisen)

Index Home About Blog
Date: Tue, 26 Jun 2007 17:56:03 +0200
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch,comp.arch.arithmetic
Subject: Re: Algorithms for decomposing large divisions?
Message-ID: <uhp7l4-mf3.ln1@osl016lin.hda.hydro.com>

Dom Jackson wrote:
> Does anyone know of an algorithm to decompose 2N-bit integer divisions
> into 1N-bit integer divisions? I'm trying to implement, for example,
> 64-bit integer division with a 32-bit division instruction; I've spent
> a couple of hours with pencil and paper without success.
>
> I know how to implement the general case of multi-word division for
> high-precision integers, but this is inefficient for this case. The
> reason is that the sub-word that the MP integer is made up from must
> be no longer than *half* the word size that the processor has an
> integer divide instruction for.

 From your other posts, I assume that you're running on x86-64 hardware,
in which case you should be able to get a fairly fast answer using your
available resources:

(I assume both numbers are unsigned, if not fix them and adjust afterwards!)

1) Scale the divisor to fit in 64 bits, using BSR to count the number of
leading zero bits in the top half of it, then SHLD. Optionally round the
bottom bit up.

2) Do the same shift of the dividend.

3) Do two linked DIV operations to divide the (shifted) 128-bit input by
your shifted 64-bit divisor. (You can do a single DIV if the shifted
dividend also fits in 64 bits!)

4) Multiply this approximate result back by the original divisor,
subtract from the original dividend, then check the remainder.

While negative, decrement the result, and add in the divisor until it is
positive.
(If you did the round up step, then it cannot be negative!)

While larger than the divisor, increment the result and subtract the
original divisor.

So, what are the worst case runtime of this process?

Each DIV used to take about 40 cycles on 32-bit x86 cpus, I'm guessing
that it is in the same ballpark for 64-bit operands, or worst case about 70.

The initial adjustment stage needs about 10 cycles, most of which are
taken by the slow SHLD opcode.

The two DIVs would be 100-200 cycles, while the back-multiply/subtract
is about 30-40.

The final adjustment loop could be replaced by a second iteration of the
approximate division, but since the original divisor had at least 65
bits (otherwise the problem is trivial, right?), the maximum possible
size of the real division result is 64 bits, and the relative error
cannot be larger than about 2^-63 or so, right?

I.e. I suspect that the worst case would require 2 or 3 iterations, with
the average _very_ close to zero.

If this isn't fast enough for you, then I suggest using reciprocal
multiplication to replace the division steps:

Use the SSE approximate 1/x lookup opcode to generate a 12-bit (+)
accurate approximation, then one or two NR iterations to get a fp double
precision approximation which is effectively exact, modulo rounding errors.

Use this fp reciprocal to calculate a 50+ bit accurate result, then
convert this back to integer and do the same back-multiply and subtract
as in the previous approach.

Run this much faster loop twice, and you'll have your desired result,
possibly off by one.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Index Home About Blog