Date: Tue, 26 Jun 2007 17:56:03 +0200 From: Terje Mathisen <terje.mathisen@hda.hydro.com> Newsgroups: comp.arch,comp.arch.arithmetic Subject: Re: Algorithms for decomposing large divisions? Message-ID: <uhp7l4-mf3.ln1@osl016lin.hda.hydro.com> Dom Jackson wrote: > Does anyone know of an algorithm to decompose 2N-bit integer divisions > into 1N-bit integer divisions? I'm trying to implement, for example, > 64-bit integer division with a 32-bit division instruction; I've spent > a couple of hours with pencil and paper without success. > > I know how to implement the general case of multi-word division for > high-precision integers, but this is inefficient for this case. The > reason is that the sub-word that the MP integer is made up from must > be no longer than *half* the word size that the processor has an > integer divide instruction for. From your other posts, I assume that you're running on x86-64 hardware, in which case you should be able to get a fairly fast answer using your available resources: (I assume both numbers are unsigned, if not fix them and adjust afterwards!) 1) Scale the divisor to fit in 64 bits, using BSR to count the number of leading zero bits in the top half of it, then SHLD. Optionally round the bottom bit up. 2) Do the same shift of the dividend. 3) Do two linked DIV operations to divide the (shifted) 128-bit input by your shifted 64-bit divisor. (You can do a single DIV if the shifted dividend also fits in 64 bits!) 4) Multiply this approximate result back by the original divisor, subtract from the original dividend, then check the remainder. While negative, decrement the result, and add in the divisor until it is positive. (If you did the round up step, then it cannot be negative!) While larger than the divisor, increment the result and subtract the original divisor. So, what are the worst case runtime of this process? Each DIV used to take about 40 cycles on 32-bit x86 cpus, I'm guessing that it is in the same ballpark for 64-bit operands, or worst case about 70. The initial adjustment stage needs about 10 cycles, most of which are taken by the slow SHLD opcode. The two DIVs would be 100-200 cycles, while the back-multiply/subtract is about 30-40. The final adjustment loop could be replaced by a second iteration of the approximate division, but since the original divisor had at least 65 bits (otherwise the problem is trivial, right?), the maximum possible size of the real division result is 64 bits, and the relative error cannot be larger than about 2^-63 or so, right? I.e. I suspect that the worst case would require 2 or 3 iterations, with the average _very_ close to zero. If this isn't fast enough for you, then I suggest using reciprocal multiplication to replace the division steps: Use the SSE approximate 1/x lookup opcode to generate a 12-bit (+) accurate approximation, then one or two NR iterations to get a fp double precision approximation which is effectively exact, modulo rounding errors. Use this fp reciprocal to calculate a 50+ bit accurate result, then convert this back to integer and do the same back-multiply and subtract as in the previous approach. Run this much faster loop twice, and you'll have your desired result, possibly off by one. Terje -- - <Terje.Mathisen@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"

Index Home About Blog