Index Home About Blog
From: David Chase <chase@world.std.com>
Newsgroups: comp.compilers
Subject: Re: Memory allocator performance in C/C++ compilers
Date: 25 Jan 1999 21:55:32 -0500
Keywords: storage, performance

ed mckenzie wrote:

> Am I missing something? Are there other things influencing allocator
> performance besides block size, number of blocks, and allocation
> pattern? Or is VC++'s memory allocator just slower than the other
> compilers?

Long, long ago, this paper was written comparing various memory
allocators:

AUTHOR = "Norman R. Nielsen",
TITLE = "Dynamic Memory Allocation in Computer Simulation",
JOURNAL = cacm,
YEAR = 1977,
VOLUME = 20,
NUMBER = 11,
MONTH = nov,
PAGES = {864--873}

and I have seen little better since then (Cartesian trees have been
invented since them, there's been some more work with the buddy system
since then, there's been more measurement since then).

The short answer is that you do quite well with multiple free lists,
one per range of sizes, for small objects.  For large objects (>= 4k,
that typically take enough time to initialize that you can afford to
think a little harder about their allocation) you might try something
like a buddy system allocator, or a first-fit-address-ordered free
list with coalescing on free.  Little objects can be allocated in a
Big Bag Of Pages, meaning that you know their size because of the page
that they are in.  The free list for big objects should NOT, NOT, NOT
be threaded through the objects themselves, on account of this leads
to atrocious VM behavior when you search the list, or when your
good-citizen program decides to free the 72Mb of storage that it
allocated 4k at a time.

When picking free lists sizes with multiple free lists, there is a
tradeoff in how you might waste memory.  Given that you are allocating
a minimum of one page per list size, there is a risk of allocating an
entire page for a single objects, for several different sizes.  On the
other hand, if the request is for 12 bytes, you really ought not round
it up to 64.  In practice, what I have done selected by default is
powers-of-two plus the inbetween sizes if any exist.  Thus, you get

   8, 16, 24, 32, 48, 64, 96, 128, 192, 256, etc

except that you need to keep in mind that you are placing these in
batches from a fixed-size page.  Thus, if the page size is 4096, then
the next size in the sequence is not 384, it is 408.  The reason is
that if you can only store 10 objects on a page, it should be the 10
largest possible objects, and 408 is the largest object that you can
get 10 of on a 4096-byte page.

You can also do reasonably well (based on benchmarks I did years ago
playing with this) with slightly adaptive algorithms that sample the
free-list-empty events to see if perhaps you should customize a free
list, where the adaptive algorithm performs a similar calculation; if
you find that you are allocating a great many 416-byte objects (9 per
4096-byte page) then you might consider creating a 448-byte bucket
(largest multiple of eight that goes into 4096 9 times).

Also, for small sizes, use direct table lookup to determine the free
list to use.  Thus, a typical allocation looks like:

  void * alloc(unsigned int size) {
     int which_list;
     /* convert byte size to doubleword size */
     size = (size + 7) >> 3;
     if (size < BIG) {
        struct list ** ptr_to_head, *item;
        which_list = size_to_list[size];
        ptr_to_head = lists + which_list;
        head = * ptr_to_head;
        if (head != 0) {
           * ptr_to_head = head -> next;
           return (void *) head;
        } else { /* REFILL LIST */
        }
     } else { /* BIG */
     }
  }

So, in terms of typical instruction count, there is

  prologue
  add, shift, compare, conditional-branch, indexed-load, add, load,
  compare, condititional-branch, load, store
  epilogue

However, this code is not thread-safe.  To make it thread-safe, you
either add locks in before loading "head" and storing back the new
pointer-to-head, where you use one lock per free list, or you do
something fancier.  On a Pentium Pro, I am told that a bus-locking
instruction costs you about 25 cycles, so add 50 cycles to the cost of
the sequences above.  Other processors might be able to do it more
efficiently (with either load-linked/ store-conditional or with one of
those combining-network busses to memory).

Also, if you have a library-aware compiler, you can inline some or all
of that code, depending upon whether the underlying allocator is
adaptive-by-size or not and whether the (often constant) allocation
size can be improved anyway.  In that case, you might see an inline
allocation of the form:

        ptr_to_head = lists + Constant_determined_by_size;
        head = * ptr_to_head;
        if (head != 0) {
           * ptr_to_head = head -> next;
           result is (void *) head;
        } else { call refill routine }

Again, not thread safe.

David Chase
NaturalBridge LLC


From: torek@elf.bsdi.com (Chris Torek)
Newsgroups: comp.lang.c
Subject: Re: Problems with malloc and free???
Date: 9 Apr 1999 12:00:51 -0700

In article <370d6365.0@newsvr.cyberway.com.sg> The Sorcerer
<vagre@cyberway.com.sg> writes:
>... from "Advanced C Programming by Example" by John W. Perry :
>                                            "..... and (2) the malloc() and
>free() functions are often poorly written. ...

I think this is an overstatement.  Certainly some variants of
malloc/free may be better than others, but that suggests an obvious
question: "better for what?"

Knuth is, as usual, a good place to start examining the tradeoffs
presented by different allocation systems.  One can write extremely
space-efficient allocators that run egregiously slowly (e.g., taking
exponential time).  At the opposite end of the spectrum, the 4.3BSD
"Caltech malloc" was incredibly *fast* -- any allocation or free
ran in, essentially, linear time -- but was horribly wasteful of
space.  It worked by rounding the requested size up to the nearest
power of two:

	void *malloc(size_t size_requested) {
		int size_index;
		struct free_list *p;

		if (size_requested < MIN_SIZE)
			size = MIN_SIZE;	/* for free list bookkeeping */
		/* ceil(log2()) is the slowest part of the allocator */
		size_index = ceil(log2(size_requested));
		if ((p = free_list[size_index]) == NULL)
			p = get_more_space_from_OS(size_index);
		if (p != NULL)
			free_list[size_index] = p->next;
		return p;
	}

The number of free lists is always manageable, e.g., under 32 (on
a 32-bit system) or 64 (on a 64-bit system) -- but a request for,
say, 132000 bytes turns into an allocation of 262144 bytes,
effectively "wasting" 130144 bytes.  (This code ultimately also
behaves poorly VM-cache-wise, once the free lists for small size
blocks are scattered across much of the address space.  Of course,
this applies to *any* free-list-oriented scheme, unless the lists
are kept sorted by virtual address, which requires CPU time.)

Sun included this version of the allocator in some versions of
SunOS, then later switched to a slow but space-efficient flavor of
malloc() when people kept running out of virtual memory.  I think
they have since switched to an intermediate allocator, deciding
they had veered too far in the opposite direction.

(FreeBSD and BSD/OS now use an allocator coded by Poul-Henning Kamp.)

In short, the problem here is that whoever implements malloc() must
trade off any number of unknown-in-advance requirements:  Should
malloc conserve space?  Should it trade space for speed (possibly
including arranging allocated blocks on staggered cache lines)?
Should it arrange to give space back to the OS whenever possible?
Should it be tuneable?  Note that any code that allows tuning must
occupy space itself, and will typically make everything at least
a little bit slower.
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA	Domain:	torek@bsdi.com	+1 510 234 3167
http://claw.bsdi.com/torek/  (not always up)	I report spam to abuse@.

Index Home About Blog