Index Home About Blog
Date: Tue, 08 May 2007 08:55:08 +0200
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: ISA quickie survey: min/max instructions?
Message-ID: <1gj5h4-9mg.ln1@osl016lin.hda.hydro.com>

Andrew Reilly wrote:
> Hi all,
>
> I was just wondering about the "conditional-ish" arithmetic operations:
> abs, min, max, and maybe others.  Most of these are available in SIMD
> instruction sets, but I haven't tracked down many of them in scalar
> instruction sets.  They could be handy for avoiding branches.  Is
> subtract+conditional move the accepted best way to handle these?  Might
> min and max show up (both for integer and floating point) in future scalar
> instruction sets? In particular, min and max would seem to fit the
> two-source-one-destination pattern, where conditional move doesn't, quite.

Having max/min integer opcodes will only help with that particular
idiom, while CMOVcc can be used in a lot of different situations.

On a modern, relatively deep, pipeline with good branch prediction, the
fastest way to handle min()/max() can often turn out to be the naive
CMP/JA/MOV sequence, as long as the branch pattern predicts well:

; Return MAX(eax, ebx)

   cmp eax,ebx
    ja done
   mov eax,ebx
done:

If EAX is the largest value and the branch is correctly predicted, then
the code above will have a latency of a single cycle!

With EBX the larger and correct prediction it takes two cycles.

In both cases a branch mispredict causes a 5-20 cycle hit.

The conditional version is
   cmp eax,ebx
   cmovb eax,ebx

This has a fixed latency of three cycles, so if the branch predictors
above hits 90+% or better, the average cost of the if () {} version
might actually be less than the 3-cycle CMOV-based code.

Two caveats:

a) Sometimes the worst case time is just as important as the average or
best case, then you should definitely use CMOV. This is also important
for crypto code where you definitely want the execution time to be
constant/independent of the key.

b) The branch predictor is a very limited resource (i.e. 128 entries on
a brand-new Core 2 Duo), so using it for a lot of min/max operations
might blow away some other more useful entry.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Index Home About Blog