Index Home About Blog
From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Date: Fri, 12 Jun 2009 12:03:36 -0700 (PDT)
Message-ID: <bc0c7aa3-94fe-4c0d-85b4-1d811143f5c8@b9g2000yqm.googlegroups.com>

On Jun 12, 7:27 am, joshc <josh.cu...@gmail.com> wrote:
> I'm wondering what factors influence the alignment required when
> accessing data from memory. For example, ARMv5 processors require that
> 32-bit accesses are 32-bit aligned to get the intended result (not
> shifted). What dictates this requirement for alignment?

It was thought that creating a requirement that all items be aligned
would be a good thing and eliminate a tiny amount of delay and
hardware in the critical Agen->Cache->LoadAlign path. I designed and
built several of these and was a vocal proponent of same (circa
1980-1993-ish).

It ends up that there is no good way to enforce this kind of
requirement. Indeed, FORTRAN common blocks are notorious for
preventing such alignment on DOUBLE PRECISION operands (also COMPLEX).
Lately, graphics processing algorythms have made it necessary to
support minimally aligned data to satisfy the requirements of short
vector operations (ala SSE, MMX) where several integers, or floating
point values are packed into a single register and/or memory location.

The ability to trap misaligned accesses was added to various x86
processors (on request of MS) so the MS could take a couple of years
and banish the misaligned accesses from their OS(s). This proved to be
impossible en-the-large.

The actual penalty in cycle time is about one 2-input multiplexer
delay compared to a normal aligned-only cache access path, a second
carry chain in the AGEN adder, and when misalignment spans a cache
bank boundary, more power. I consider this a much better solution than
traping and fixing the problem in SW.

Thus, it is my considered opinion that; in this day and age, one
should simply build the misaligned data access path through the first
level cache, and have hardware solve the boundary condition problems.

This is one thing we RISC designers got wrong.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Date: Sat, 13 Jun 2009 18:38:24 -0700 (PDT)
Message-ID: <300cb1ae-36e9-4181-9448-0724a6a7125d@k38g2000yqh.googlegroups.com>

Can this loop be converted to run optimally in SSE, vis, altivec?

     for( i = 0; i < max; i++ )
          a[i] = a[i]+b[i]*c[i];

Answer: Only if you know that the alignment of a,b, and c are {double}
quadword aligned ! or do alignment in HW.

     for( i = 0; i < max; i+=4 )
     { // one computation instruction + 2 loads and 1 store
          a[i+0] = a[i+0]+b[i+0]*c[i+0];
          a[i+1] = a[i+1]+b[i+1]*c[i+1];
          a[i+2] = a[i+2]+b[i+2]*c[i+2];
          a[i+3] = a[i+3]+b[i+3]*c[i+3];
     }

Here, you end up computing, not at the alignment of the datum, but at
the alignment useful for the short vector instructions (4 or 8 of
these datums).

I hope everone understands that the programmer should NOT have to
worry about alignment of data (en-the-large) when the compiler
attempts "vectorization" on this code snippet. It maters not whether
the base datum is byte, halfword, word, doubleword, quadword, or
doublequadword; in either floating point or integer. Thus, alignment
'tolerance' enhances the ability of the compiler to "vectorize"
natural computations into the short vector formats that are prevelent
in todays architectures.

Mitch


From: John Levine <johnl@iecc.com>
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Date: Sun, 14 Jun 2009 20:23:50 +0000 (UTC)
Message-ID: <h13m8m$25fm$1@gal.iecc.com>

>I'm wondering what factors influence the alignment required when
>accessing data from memory.

The oldest machine I know with alignment rules was the IBM System/360
back in the mid 1960s.  There were a bunch of models with very
different implementations.  All but the smallest had two or four byte
wide core memory driven from sequential microcode, which would have
required extra cycles to deal with unaligned operands and a definite
performance hit.

Late in the life of the 360 series, they invented memory caches for
the 360/85, which got rid of much of the performace hit.  In the early
1970s, the follow-on 370 series removed nearly all data alignment
rules other than a few related to interrrupts.  I've read some
articles they published on their more recent Z series processors,
which say that alignment has no performance impact at all any more.

R's,
John


Date: Mon, 15 Jun 2009 16:34:40 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Message-ID: <sJCdnRQTrLGcwavXnZ2dnUVZ_tudnZ2d@giganews.com>

Anton Ertl wrote:
> MitchAlsup <MitchAlsup@aol.com> writes:
>> Can this loop be converted to run optimally in SSE, vis, altivec?
>>
>>     for( i = 0; i < max; i++ )
>>          a[i] = a[i]+b[i]*c[i];
>>
>> Answer: Only if you know that the alignment of a,b, and c are {double}
>> quadword aligned ! or do alignment in HW.
>>
>>     for( i = 0; i < max; i+=4 )
>>     { // one computation instruction + 2 loads and 1 store
>>          a[i+0] = a[i+0]+b[i+0]*c[i+0];
>>          a[i+1] = a[i+1]+b[i+1]*c[i+1];
>>          a[i+2] = a[i+2]+b[i+2]*c[i+2];
>>          a[i+3] = a[i+3]+b[i+3]*c[i+3];
>>     }
>>
>> Here, you end up computing, not at the alignment of the datum, but at
>> the alignment useful for the short vector instructions (4 or 8 of
>> these datums).
>
> This is a good case for vector instructions not requiring vector
> alignment.  But element alignment still can be useful for finding
> mistakes in address arithmetic code.  Of course, some people say that
> we should program in higher-level languages where these mistakes
> cannot happen.  But then all element accesses will be aligned in these
> programs anyway, and there is no need to support misaligned accesses
> to elements.
>
> Of course, if max is large, it's probably still a good idea to use
> aligned accesses in the inner loop, and have a prologue and an
> epilogue that deals with the elements in front of the first aligned
> subvector and after the last aligned subvector.

The real problem with Mitch's example is that the three vectors a,b,c
could have totally independent alignment, in which case the best you can
possibly do is to handle single elements until a[n] is register-size
aligned, then use vector shuffle instructions on pairs of registers to
extract the corresponding parts of b[] and c[].

For production code you probably want to specialcase at least the
following alternatives:

a) All three vectors 16-byte aligned.

b) All three identically aligned but not 16-byte: This is the default
situation if you allocate all three arrays with normal malloc() and
specify a size which is a multiple of 16, due to the preceding control
block.

c) Arbitrary alignment of all arrays: Use the approach I outlined above.

Terje

PS. If you are very careful, it is often possible to do a variably-sized
preamble using aligned load/store and a write mask.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


Date: Mon, 15 Jun 2009 16:23:27 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Message-ID: <VsadnaeFR_38xKvXnZ2dnUVZ_v-dnZ2d@giganews.com>

nmm1@cam.ac.uk wrote:
> In article <vs6dnc9mAuRCbq7XnZ2dnUVZ8vmdnZ2d@giganews.com>,
>  <jgd@cix.compulink.co.uk> wrote:
>> For an open source system, this may be as easy as rebuilding everything
>> and then trying to test everything and fix all the problems. As in, not
>> easy at all. For a commercial OS distributed as binaries, it isn't a
>> practical matter.
>
> Yes, indeed.  That is why I favour the hard line.  Just as with
> arithmetic exceptions, but that is even more a lost cause :-(

Mitch has a very crucial point though, in regard to short vector
operations: These really need to accept the natural alignment of the
constituent vector units, not require vector-register-size alignment.

Without this capability, i.e. the current situation, only very carefully
written code, with complete alignment control of all inputs and outputs
can be vectorized automatically without anything close to theoretical
performance.

For my own SSE code I always allocate all data structures with 16-byte
alignment, but that means that I cannot accept arbitrary arrays as inputs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


Date: Mon, 15 Jun 2009 16:18:17 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: What factors influence required memory alignment?
Message-ID: <-5Gdnfzo8KW3xavXnZ2dnUVZ_qWdnZ2d@giganews.com>

Stephen Fuld wrote:
> MitchAlsup wrote:
>
> snip
>
>> Note to Nick and Brett: nowhere did I mention removing the potential
>> for a misalignment trap. Just put it on a control bit in some control
>> register. For the majority of software than can live within aligned
>> only memory models, set the bit to take the trap. For those that
>> simply cannot, set the bit the other way.
>
> ISTM that it would be hard to argue with Mitch's "whatever way you want
> it" approach, though I am sure someone will.  :-)

I'll argue:

I would accept such a bit _now_, on x86, but only if it was per (user)
process, and preferably by code page as well: That would allow me to
turn on trapping for just a single interesting piece of code, without
having to worry about all the remaining os and library segments that
would trap.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"



Date: Wed, 22 Apr 2009 17:49:47 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: The coming death of all RISC chips.
Message-ID: <YYudnXNLpcAGoXLUnZ2dnUVZ8v2dnZ2d@giganews.com>

Anton Ertl wrote:
> jgd@cix.compulink.co.uk writes:
>> In article <2009Apr22.140724@mips.complang.tuwien.ac.at>,
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> OTOH my impression is that the SIMD instructions are not utilized
>>> very well by compilers and most of the SIMD usage we see nowadays
>>> is due to hand-tuned kernels
>> This matches my experience, working with short arrays of doubles with
>> SSE2. The basic SSE2 load-pair-of-doubles instruction requires that they
>> be 16-byte aligned;
>
> That's a paradoxical screwup.  The architecture accepts byte alignment
> almost everywhere, but for SSE2 it not just wants stuff aligned to the
> element size (which would be reasonable), but even to the vector size
> (which is not).

It isn't nearly that bad:

There are perfectly usable unaligned SSE load/store instructions
available, it is only if you want to use the load-op combinations that
you have to follow the same rules as the rest of the x86 architecture:

"Always align all load/store operations at the size of the operation."

When you're doing 16-byte memory ops, they really do need to be 16-byte
aligned, if you want to have close to optimal performance.

>
>> AMD have added an instruction that only requires that
>> the pair be 8-byte aligned, and now we just have to wait for it to be
>> generally available.
>
> Our Xeon 3070 (aka Core 2 Duo E6700) supports the MOVDQU instruction.

They all support something like that, even if the early versions didn't
expose this for fp load/store, only integer.

However, since the registers are the same, it is legal to use integer
ops on fp data and vice versa, Intel have just warned us that  future
implementation (presumably with a split register file) might stall on
these casts.

In my own code I took the easy way out, which was to use a custom
allocator for all arrays, and then make sure that it always returned
16-byte aligned blocks.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


Date: Thu, 23 Apr 2009 21:57:26 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: The coming death of all RISC chips.
Message-ID: <yoqdnXpnjMK7VW3UnZ2dnUVZ8hCdnZ2d@giganews.com>

Anton Ertl wrote:
> Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>> It isn't nearly that bad:
>>
>> There are perfectly usable unaligned SSE load/store instructions
>> available, it is only if you want to use the load-op combinations that
>> you have to follow the same rules as the rest of the x86 architecture:
>>
>> "Always align all load/store operations at the size of the operation."
>
> I don't know which x86 architecture you mean, but in the non-SSE parts

I mean the x86 architecture where my code runs at (close to) optimal speed!

Using misaligned memory ops, in particular writes, have been really bad
for a long time.

>> When you're doing 16-byte memory ops, they really do need to be 16-byte
>> aligned, if you want to have close to optimal performance.
>
> But if I don't know that the data is 16-byte aligned, then I would
> expect an unaligned load-op not to be any worse than an unaligned load
> and a reg-reg op.

Only if the load-op pipeline has a fast path for the unaligned case,
otherwise you would have to depend on microarchitecture splitting of
load-op into two micro-ops.

>> In my own code I took the easy way out, which was to use a custom
>> allocator for all arrays, and then make sure that it always returned
>> 16-byte aligned blocks.
>
> A custom allocator won't help a compiler generate vectorized code,
> because the compiler won't know about the alignment coming from the

In the Intel code I've seen, the compiler depends heavily on just this,
in the form of #pragma's that specify 16-byte alignment of arrays.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Index Home About Blog