Index Home About Blog
From: Maynard Handley <name99@name99.org>
Newsgroups: comp.arch
Subject: Re: Reasons for the big paradigm switch
Message-ID: <name99-E1C7C0.22474020112006@localhost>
Date: Tue, 21 Nov 2006 06:47:44 GMT

In article <i42334-0mf.ln1@osl016lin.hda.hydro.com>,
 Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

> JJ wrote:
> > Again that damned Memory Wall, newer cpus only seem to get faster for
> > problems that have high locality of reference say video codecs
>
> Video codecs???
>
> Have you taken a look at h264 (HD-DVD/Blueray), with the CABAC coding?
>
> 1) Motion vectors can go pretty much anywhere, to any previously decoded
> image: This means huge memory buffers! (NVidias recent announcement said
> that you need a dualcore cpu and 1 GB ram just to decode HD-DVD, even
> with presumably as much hw assist as they could get out of the GPU.)
>
> 2) The actual bit-for-bit decoding is horrible, with extreme dependency
> chains, branching that by design has to be nearly impossible to predict
> (otherwise you should have compressed better, right?).

 Both true --- though truth is that there is, as you would expect, some
spatial locality in the prediction, even though in theory it can be
anywhere in any one of a dozen or so earlier frames. The problem, of
course, is do you provision for the absolute worst possible case, or do
you provision for the expected case and just resign yourself to skipping
frames and whatever else is necessary to try to keep up when things get
bad?

What is especially tragic is that the arithmetic coding didn't have to
be so retarded. The arithmetic coding in JPEG2000 is (relatively) very
CPU friendly. But the H264 guys made some minor changes that have major
effects on how it can be implemented. The standard tragedy of refusing
to believe that anyone else might have done something better than you
can, and so you should at least look at their ideas.


> The truth of the matter is that this _very_ important video codec is
> tailormade to be extremely hard to handle on a modern cpu. :-(
>
> Terje


Date: Sun, 19 Nov 2006 21:29:16 +0100
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: HD-DVD decoding (Was: Re: Reasons for the big paradigm switch)
Message-ID: <xdSdnZzGa_M8I_3YnZ2dnUVZ8tidnZ2d@giganews.com>

Bernd Paysan wrote:
> Terje Mathisen wrote:
>
>> JJ wrote:
>>> Again that damned Memory Wall, newer cpus only seem to get faster for
>>> problems that have high locality of reference say video codecs
>> Video codecs???
>>
>> Have you taken a look at h264 (HD-DVD/Blueray), with the CABAC coding?
>>
>> 1) Motion vectors can go pretty much anywhere, to any previously decoded
>> image: This means huge memory buffers! (NVidias recent announcement said
>> that you need a dualcore cpu and 1 GB ram just to decode HD-DVD, even
>> with presumably as much hw assist as they could get out of the GPU.)
>
> I strongly doubt that anyone would make a HD-DVD or BR disk with these
> requirements. After all, the stand-alone players won't have a GB of memory
> (more 256MB, like the PS3).

A HW player will never have to fall back on multi-core decoding, where
each cpu core starts at a separate key frame. Doing this is the
obvious/easy way to handle the worst case videos in SW, but it
immediately doubles the size of the frame buffers.

In MPEG-2 any given frame could never refer to more than two reference
frames, and these were clearly marked in the video stream, so that the
decoder knew that it had to buffer them. H264 can refer to more or less
anything previously decoded, including sub-blocks previously decoded in
the current frame itself. I.e. no (easy) way to run two cores on the two
halves of a frame, except for Blueray which slices each image into
separately encoded sub-frames.

> I can decode 1920x1080 h264 trailers almost realtime with the most recent
> MPlayer on my 2GHz Athlon64 (754 chip, AGP graphics card), and AFAIK the
> transfer via AGP takes a considerable amount of time.

As Chih-Chung noted, the real problem is high bitrate streams,
particularly if you also have relatively low resolution: This means a
_lot_ of multi-bit tokens using single-bit decoding, with square law or
worse increase in decoding work. I've seen a graph showing CABAC
decoding alone becoming 60-80% of the total work. :-(

For video streams where motion vectors can handle nearly all the
encoding, there's very few residual bits needed (or available!) to
correct erroneous pixels, but as the bit rate goes higher things goes
downhill pretty rapidly.

>> 2) The actual bit-for-bit decoding is horrible, with extreme dependency
>> chains, branching that by design has to be nearly impossible to predict
>> (otherwise you should have compressed better, right?).
>
> Yes - but I'm not sure if you actually need branching to decode it. A state
> table might be worth a try, dependent loads are significantly faster than
> dependent branches.

That is indeed possible, but quite hard to do in 32-bit mode: Register
pressure becomes horrible.

>> The truth of the matter is that this _very_ important video codec is
>> tailormade to be extremely hard to handle on a modern cpu. :-(
>
> Apparently.

Yeah, sort of the opposite of AES which had effective sw implementations
as a requirement.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


Date: Tue, 21 Nov 2006 13:57:05 +0100
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Reasons for the big paradigm switch
Message-ID: <2m9b34-m99.ln1@osl016lin.hda.hydro.com>

Bernd Paysan wrote:
> Chih-Chung Chang wrote:
>> (I do x86 software HD-DVD/Blu-Ray/H.264 players for a living)
>>
>> I think you will find the requirements pretty common among PC software
>> players. The reason is for users who can afford HD/BD drives now, 1GB
>> RAM is not too costly (and the RAM can be used for other applications).
>
> Well, that's certainly true ;-). But when you assume that making the PS3 BD
> drive costs about $80, wouldn't you expect cheaper PC BD drives in the near
> future (when the blue laser availability problem is solved)?
>
>> Stand-alone players need to optimize more for memory because it directly
>> affects their component cost.
>>
>> Another reason is software decoders have no guaranteed decoding time
>> for each frame, so large buffers are used for smooth out the jitter.
>
> Get a real time operating system ;-). Honestly: I wonder why the major
> operating system developers (be that Bill Gates and his horde or Linus
> Torvalds) don't understand a bit about real-time, even though their
> operating systems are used as media player, which have pretty hard
> real-time requirements.

I agree. There have been some good news about getting stuff from
RT-Linux into the main kernel afair.

>>> I can decode 1920x1080 h264 trailers almost realtime with the most recent
>>> MPlayer on my 2GHz Athlon64 (754 chip, AGP graphics card), and AFAIK the
>>> transfer via AGP takes a considerable amount of time.
>> 1920x1080 is no problem. The problem is high bitrate. Try something
>> with 20Mbps or more (The maximum of Blu-Ray is 40Mbps).

BR splits each frame into (afaik) 4 slices, which means that it is easy
to have up to four cores working in parallel, HD-DVD @ 30 Mbit is
probably the absolute worst case when you do have multiple cores.

> The trailers available on e.g. Apples web-site usually are no more than
> 10Mbps.
>
>> Yes, it is possible to do it without branch. ffmpeg has a branchless
>> version of CABAC which uses cmov. It is faster on most x86, but still
>> takes a lot of time.
>
> I'm thinking about something different than using cmovs, more in the line of
> a fast huffman decoder. If you have a static tree to decode, you can create
> a table that is indexed by state, next x bits of the input stream, and
> there you'll find the next state and a bit string to insert into the output
> stream. No (conditional) branching, no limit to decode several tokens in
> one go. The time limiting factor is the dependent load, since the next
> address depends on the state loaded in the previous go. You might want to
> fit the table in the L1 cache.

This is where you go wrong: :-(

The CA in a CABAC decoder stands for Context-Adaptive, which in this
case means that until you've completely decoded a bit and possibly
(usually!) taken a branch that depends on the value of it, do you get to
figure out which context to use for the next single-bit decoding step.

Ouch!

There are some places, esp. when working with big multi-bit values,
where the context index(which points to 7 bit value) stays constant for
a while, but even so the minimum state you have to carry around consists
of 9+9=18 bits, which makes for a pretty big lookup table.

When you bypass the CABAC update logic, which is done for particularly
long tokens, you need an extra input bit as well before you can figure
out the value of the next bit to return.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


From: Maynard Handley <name99@name99.org>
Newsgroups: comp.arch
Subject: Re: Reasons for the big paradigm switch
Message-ID: <name99-0511E3.11401821112006@localhost>
Date: Tue, 21 Nov 2006 19:40:20 GMT

In article <57ab34-5d9.ln1@osl016lin.hda.hydro.com>,
 Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

> Maynard Handley wrote:
> > What is especially tragic is that the arithmetic coding didn't have to
> > be so retarded. The arithmetic coding in JPEG2000 is (relatively) very
> > CPU friendly. But the H264 guys made some minor changes that have major
> > effects on how it can be implemented. The standard tragedy of refusing
> > to believe that anyone else might have done something better than you
> > can, and so you should at least look at their ideas.
>
> What did they do differently?
>
>  From banging my head at h264 I would guess that blocking together
> tokens with the same context could be a big win?
>
> Please tell!
>
> Terje

It's three years since I dealt with this stuff. If you are interested in
details, I'd recommend looking at the chapter on the mechanics of how to
decode arithmetic coding in the big JPEG2000 book.
I had more than one complaint, but the one I recall right now had to do
with fussy details of a branch. Basically the way things were defined in
JPEG2000, you could perform a test (which would determine the direction
of the branch) then perform all the other various calculations to update
the state machine, then take the branch. x86 branching is all over the
place these days, but on PPC this matched the condition-register
architecture very nicely. In CABAC that same branch direction was
defined as requiring the results of the updated state machine before you
knew the direction, thus preventing this sort of minor league
parallelism.

As I say, there were other problems (including an outright bug in the DS
that I caught), but they're all lost in the mists of time.


Date: Wed, 22 Nov 2006 10:34:59 +0100
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Reasons for the big paradigm switch
Message-ID: <g7id34-b93.ln1@osl016lin.hda.hydro.com>

Maynard Handley wrote:
> It's three years since I dealt with this stuff. If you are interested in
> details, I'd recommend looking at the chapter on the mechanics of how to
> decode arithmetic coding in the big JPEG2000 book.
> I had more than one complaint, but the one I recall right now had to do
> with fussy details of a branch. Basically the way things were defined in
> JPEG2000, you could perform a test (which would determine the direction
> of the branch) then perform all the other various calculations to update
> the state machine, then take the branch. x86 branching is all over the
> place these days, but on PPC this matched the condition-register
> architecture very nicely.

Even in x86 this works well: With any post-Pentium CPU the OoO core will
be able to evaluate the branch direction early, directly reducing the
latency of each evaluation. To get this benefit you only need to inline
the core cabac decoding step (something which you'll probably do
anyway), since otherwise the forced RET branch could stall the cpu a few
cycles later anyway.

I think I've worked out that the minimum latency of a single-bit cabac
decode is on the order of 10 cycles in 32-bit x86 mode:

This is the time from when you know which context to work with (which
isn't known until you've decoded the previous bit), until you've
determined what the current bit will be and used it in a branchless
manner to update the dependent variables.

> In CABAC that same branch direction was
> defined as requiring the results of the updated state machine before you
> knew the direction, thus preventing this sort of minor league
> parallelism.

Yeah, this sucks. :-(

Terje

PS. I've argued for having Altivec permute in x86 for several years now,
it seems getting Apple on board have finally caused PSHUFB to be
defined, as a not-quite-as-good alternative.

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


Date: Thu, 22 Oct 2009 08:26:02 +0200
From: Terje Mathisen <Terje.Mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Is it time to stop research in Computer Architecture ?
Message-ID: <fJKdndwZFJtnZ0LXnZ2dnUVZ8h-dnZ2d@lyse.net>

Andy "Krazy" Glew wrote:
> That's the whole point: you want to get as many cache misses outstanding
> as possible. MLP. Memory level parallelism.
>
> If you are serialized on the cache misses, e.g. in a linear linked list
>
> a) skip ahead to a piece of code that isn't. E.g. if you are pointer
> chasing in an inner loop, skip ahead to the next iteration of an outer
> loop. Or, to a next function.

Andy, you really owe it to yourself to take a hard look at h264 and
CABAC: In approximately the same timeframe as DES was replaced with AES,
with a stated requirement of being easy to make fast/efficient on a
PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive
Binary Arithmetic Coder" was the best choice for a video codec.

CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
last of these branches depends on the value of that decoded bit.

Until you've made that branch you don't even know which context to apply
when decoding the next bit!

(I have figured out workarounds (either branchless code or making them
predictable) for most of those inline branches in the bit decoder, but
that last context branch is unavoidable.)

The only possible skip ahead is a really big one: You have to locate the
next key frame and start another core/thread, but this approach is of
extremely limited value if you are in a realtime situation, i.e. video
conferencing.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


Date: Fri, 23 Oct 2009 08:28:10 +0200
From: Terje Mathisen <Terje.Mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Is it time to stop research in Computer Architecture ?
Message-ID: <nuydnULPL-Vn0XzXnZ2dnUVZ8iKdnZ2d@lyse.net>

Andy "Krazy" Glew wrote:
> Terje Mathisen wrote:
>> Andy "Krazy" Glew wrote:
>
>> Andy, you really owe it to yourself to take a hard look at h264 and
>> CABAC: In approximately the same timeframe as DES was replaced with
>> AES, with a stated requirement of being easy to make fast/efficient on
>> a PentiumPro cpu, the MPEG working groups decided that a "Context
>> Adaptive Binary Arithmetic Coder" was the best choice for a video codec.
>>
>> CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
>> last of these branches depends on the value of that decoded bit.
>>
>> Until you've made that branch you don't even know which context to
>> apply when decoding the next bit!
>>
>> (I have figured out workarounds (either branchless code or making them
>> predictable) for most of those inline branches in the bit decoder, but
>> that last context branch is unavoidable.)
>>
>> The only possible skip ahead is a really big one: You have to locate
>> the next key frame and start another core/thread, but this approach is
>> of extremely limited value if you are in a realtime situation, i.e.
>> video conferencing.
>>
>> Terje
>
>
> I have looked at CABAC, and you are right, it seems to be the branch
> equivalent of chasing down a hash chain.
>
> It is also the equivalent of good online compression (well, duh) and
> encryption, where every bit depends on all previous bits (and possibly
> some or alll future bits.
>
> And if you can't skip to the next independent chunk of work - if there
> is no independent work to skip to - you are screwed. You have to make
> the dependent stuff run faster. Or do nothing at all. You make the
> dependent stuff run faster by architectures that make sequential code
> run faster - by having faster ALUs or, if it is important enough, by
> having dedicated hardware. Is CABAC important enough?

It is almost certainly important enough that anything remotely
power-sensitive will need dedicated hw to handle at least the CABAC part.

> E.g. Terje, you're known to be a Larrabee fan. Can you vectorize CABAC?

Not at all, afaik.

> I'm not opposed to making sequentially dependent stuff run faster. I'm
> just observing that, if device limitations get in the way, there are
> lots of workloads that are not dominated bny sequentially dependent
> stuff (at the fine grain).
>
> As for CABAC, I must admit that I have some hope in algorithmic
> techniques similar to those you were recently discussing for
> parallelizing encryption.
>
> For example: divide the image up into subblocks, and run CABAC on each
> subblock in parallel. To obtain similar compression ratios you would

This is the only silver lining: Possibly due to the fact that they were
working on PS3 at the time,Sony specified that Bluray frames are all
split into 4 independent quadrants, which means that they could
trivially split the job across four of the 7 or 8 cell cores.

This also reduced the size of each subframe, in 1080i, to 256 K pixels. :-)

> have to have keyframes less frequently. Bursts at keyframes possibly
> could be avoided by skewing them.
>
> Moreover, I remain a fan of model based encoding. Although that requires
> significantly more computation, it is parallel.

OK.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


From: Terje Mathisen <terje.wiig.mathisen@gmail.com>
Newsgroups: comp.arch
Subject: Re: Is it time to stop research in Computer Architecture ?
Date: Fri, 23 Oct 2009 23:59:59 -0700 (PDT)
Message-ID: <591da682-3d9f-4d9b-b598-8581c04a499d@l34g2000vba.googlegroups.com>

On Oct 23, 1:45 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> Andy "Krazy" Glew wrote:
> > E.g. Terje, you're known to be a Larrabee fan.  Can you vectorize CABAC?
>
> Not a chance.
>
> > For example:  divide the image up into subblocks, and run CABAC on each
> > subblock in parallel.
>
> Problem is with the standard. H.264 specifies that the frame is CABAC
> encoded.

Not quite:

H.264 defines two alternate encoding schemes, of which CABAC gets the
better compression, but it is fully compliant to use the other (I
don't remember the name of it) if the encoder wants to.

However, since a decoder has to be able to handle CABAC as well, that
limits the maximum bitrate that you can support in sw.

Terje

Index Home About Blog