Self-modifying code(John R. Mashey, John Reiser, Dennis Ritchie)

Index Home About Blog

From: mash@mash.wpd.sgi.com (John R. Mashey)
Subject: Re: self modifying code in OS's

1) Some early machines required locally self-modifying code
	for various reasons.

2) Many popular machines (S/360s under various OS's, X86 under MS/DOS, etc)
allow self-modifying code, sometimes without *any* notice to the hardware:
	S/360: (syntax may be a little off, it's been a while)

	(length-1 in R1)
	STC	R1,X+1		store length 
X:	MVC	0(0,R3),0(R4)	length came from preceding instruction
				don't even need X86-branch for warning!

3) Many OS's (UNIX, VMS, etc) generally run the code read-only,
however, there are plenty of *real* applications that:
	a) Generate code on the fly
	b) Read in pre-compiled code and call it

4) Most current RISCs, at least, from Day 1 have made it clear to
user-level code that it *must* call some function(s) when it writes
data (or reads data outside) that it now thinks is code and expects
to execute it, so that either the function directly, or by doing a system call,
can do whatever it takes to flush the relevant piece of Icache (for
example), or do whatever else it has to do.
Note, of course, that this can get fairly "exciting" in some cache-coherent
SMP systems.... and what has to be done depends a lot on the specific
cache design, which often varies among family members.  There are also
potentially weird cases if you use virtual caches and then permit
unconstrained multiple virtual-physical mappings (i.e., generating code
into a shared-memory arena).

5) In the simplest sequential implementation (fetch instruction 1 from
memory, decode it, execute it, fetch instr 2 .... etc) there is little or
no penalty for allowing the STC/MVC case above.

6) The more pipelined, and more low-level microparallelism you get,
the *less* you want to have self-modifying code, and the more it is
likely to cost you to support it. So, in particular, any of the following
need to be worried about:

instruction pre-fetch buffers
decoded instruction caches [ugh, may have to flush big chunk: suppose
modifying instruction changes one instruction into two, which is quite
possible, although strange]
deeper pipelining
superscalar issue [consider that a 4-issue machine with a 4-deep pipeline
	often could have 16 instructions in progress]
dynamic (out of-order) issue (now you can be even further out, and have
	even executed instructions, not just issued them
multi-level branch prediction

I.E. ALL THE THINGS PEOPLE ARE DOING ARE BAD FOR SELF-MODIFYING CODE...
and in particular it is really unwise to expect that it will be generally fast
to generate/modify code in small increments on the implementations you're
likely to be seeing over these next few years.  There may be specific
cases that you can get away with ... but watch out.

7) Suggestion: a *good* technique is to generate/load code in big chunks,
thus amortizing any of the flushing/synchronization overhead
(whether done by instructions at user level or by system calls, there is
still plenty of cache-line thrashing and potential cache-coherency
transactions, i.e., you may well spend 100s of cycles to change 1 word...
so don't do flushes a word at a time, or you'll be sorry...

A technique that I think works on most systems, even when you're generating
code in small pieces at a time to be executed quickly, and that tends to work
on most cache designs is:

	a) Allocate a good-sized chunk of memory as a code-arena.
	b) Do whatever has to be done to flush that chunk from the Icache(s),
	thus guaranteeing no stale instructions in the I-cache.
	c) Within the code-arena, generate each small chunk of code,
	and then you *should* be able to execute it "quickly" without
	further flushes.  (you may have to do something to make sure that
	writes-in-progress have gotten far enough into the cache hierarchy
	that the immediately following I-cache miss is fulfilled from
	the changed data.)
	d) It would be wise to start each piece of generated code on
	the maximum cache-line-size boundary you expect ...
	e) Rather than over-writing chunks of code directly, keep writing
	new ones to the end of the arena, then go back to b), & flush the
	whole thing again, then re-use the empty pieces.

It would probably be a service to the industry if different vendors would
post in this newsgroup:
	a) What they require from user-level codes.
	b) What function calls are names and what they do...
It is too much to hope that the same names are used, or do exactly the same
thing.... :-)

IRIX provides:

#include <sys/cachectl.h>

     int cacheflush (void *addr, int nbytes, int cache);

Flushes contents of indicated cache(s) for user addresses in the range
     addr to addr+nbytes-1).  The cache parameter may be one of:

     ICACHE           Flush only the instruction cache

     DCACHE           Flush only the data cache

     BCACHE           Flush both the instruction and the data cache

Also related is:

     cachectl - mark pages cacheable or uncacheable

     #include <sys/cachectl.h>
     int cachectl (void *addr, int nbytes, int op);

DESCRIPTION
     The cachectl system call allows a process to make ranges of its address
     space cacheable or uncacheable.  Initially, a process's entire address
     space is cacheable.

     The op parameter may be one of:

     CACHEABLE        Make the indicated pages cacheable

     UNCACHEABLE      Make the indicated pages uncacheable

SUMMARY: 
	1) Don't do self-modifying code.
	2) Use on-the-fly generated code carefully.
	3) Likely directions in computer design are going the other way... 

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 7U-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Dynamic Recompilation (was Re: what's self-modifying code?)
Date: 1 Sep 1995 20:44:42 GMT
Organization: Silicon Graphics, Inc.

In article <421cp3$nvl@gap.cco.caltech.edu>,
toddpw@mince.ugcs.caltech.edu (Todd P. Whitesel) writes:

|> However, disallowing self-modifying code prohibits those high-performance
|> emulators that dynamically recompile code so it can be executed directly
|> instead of interpretively simulated.

|> Or would they? Is this technology important enough to warrant architectural
|> pain and suffering? Are there "good" ways that architectures can support it?

Following assumes a long-lived general-purpose architecture, with
mtuliple implementations at the same time and over time, reprogrammable,
and with efficient upward compatibility strongly desirable.

1) If you start with a simple sequential design and no caches, you can
have "store into next instruction" work with no cost.  (On S/360s, I
once wrote code that kept some flags as branch-condition-mask fields
inside branch instructions, and just changed them from branch-always to
nops, or back.)

If you ever allowed code like this, you'll support it for a long time.
It is quite possible to support it, but it may:
	a) Restrict the range of micro-archiectures & cache hierarchies
	b) Or may have serious hardware costs.
	c) and some cases may surprise you.

Maybe folks who've designed IBM or Amdahl mainframes might comment,
since they've had high-performance designs for years that do handle
this problem.

2) But, it may have serious implications on later designs that you do.

3) Therefore, it is *really* a good idea, from day one, to provide a
user-level API for a) flushing I-cache, or b) saying that a region of
memory that was instructions or data is now over-written with new
instructions, etc ...
It may be that:
	a) The API invokes simple user-level code.
	b) The API sometimes invokes system calls.
It is a good idea if the API is something implemented with dynamic shared
libraries, not bound into the binaries, since it may be different on
different machines.  It is really important that the operating systrem be
able to find out that the user code is doing this, if neeeded.

4) Following are a few examples of cases that come up, assuming the
worst possible thing, which is expecting
	store into next instruction
to work like it would on the simplest sequential design.

a) Uniprocessor, joint cache
	This is easier than split I&D caches, of course.
	On the other hand, if the CPU is pipelined, no instruction had better
	commit any state very early, i.e., specifically, at the time you
	compute the store address, you need to check against the addresses
	of instructions in the pipeline, and be prepared to flush everything
	in the pipes.  [Likely choice: some simple, but safe check ... which
	doesn't result in a flush very often. Probably not too bad in
	a large chip-count CPU; in some single-chip designs, may be awkward
	due to layout reasons, or may not.]
	If you have an instruction queue, rather than direct I-fetch-decode,
	you have to check versus the queue.

b) Uniprocessor, split I&D cache.
	As above, except you need to make I & D caches coherent, flushing any
	cache line from the I-cache if you store into the D-cache.  Since the
	whole point of split I- and D-cache is to get more bandwidth, some
	of the simple solutions are distasteful ... but you could:
		1) Have a duplicate set of I-cache tags and check them
			in parallel with D-cache tag. (die space: ugh).
		2) Have an extra bit per D-cache line that indicates
			that the I-cache might also have this line.
			You need to set this bit right on any refill of
			either I-cache or D-cache, so it might cost you
			a cycle on refill, but less die space.

c) Multiprocessors, split I&D caches
	Here, you not only need to keep the I&D cache on-chip coherent ...
	the hardware needs to keep all of the I-caches coherent ...
	That is, all sorts of cases in accessing the data cache need to let
	the *other* I-caches about it.

On some CPU and OS designs, you might find ways to make self-modifying
code work without bothering to tell the OS, maybe.
For example, suppose you have:
	1) uniprocessor
	2) Split I&D caches, no coherency between them
	3) They are direct-mapped
	4) To give user code a chance, they are non-hashed, virtual indexed,
		virtual-tagged.
	5) D-cache is write-back.
 Now, you can:
	a) Allocate an arena into which you will generate code, as
	big as you like.
	b) Execute a linear set of nops equal to the I-cache size,
	thus assuring that nothing that is in the code arena is also present
	in the I-cache.  [Of course, this won't work if you change 3) above
	much, i.e., if it's physically-indexed with TLB in front of the cache,
	the OS may map things in different places.  If the caches aren't
	direct-mapped, simple version doesn't work either.
	c) Generate code into the arena.
	d) Make sure the newly-written code (in the D-cache) is pushed back to
	memory, by issuing 1 read/cache line to memory locations that will
	alias with the just-written cache lines and thus force them out.

Of course ... when you now move this software to the SMP that you'd likely
build with the same CPUs (coherent D-caches, incoherent I-caches) ...
your code dies horribly, with wonderfully erratic results, unless the SMP
has some directive to lock this task on exactly one CPU forever...

On the other hand, even with hardware-noncoherent I-caches, an OS has
to have some way internally to clear the I-caches for page-in of code,
and if there's a user-level API to get to that, you get something
that may be simple for a uniprocessor, and more complex for an SMP,
but still works...

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Self-modifying code (Was: Instructions in PDP-10 registers 
	(was...))
Date: 22 Jan 1998 22:04:18 GMT

In article <6a4vjk$cu1@news.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

|> > Easy to laugh about now; major computer architecture back then.

|> In the last few years, run-time code generation has been a hot topic
|> at compiler conferences. For the fastest run-time code generators,
|> putting the instructions together is a bottleneck, so they would
|> benefit from instructions for doing this. I would not laugh about
|> this.

In general, yes, but it is not clear that "instructions" alone are
the right way to deal with this problem, which is is actually one of the
more difficult computer architecture problems around...  and generally
still not exactly solved very satisfactorily (in my opinion).
The SPARC V8 spec was one of the better ones to spec the right thing without
implying too many constraints onthe hardware, but there are still
performance surprises possible.

Consider the layers of the problem:

1. User program
1.1 Wants to generate large masses of code
OR
1.2 Wants to generate/overwrite individual instructions

(Both cases legitimately occur; I recall John Reiser's slick on-the-fly
code for 68K bitblt in the Bell Labs Blit terminals in the early 1980s;
or Peter Deutsche's 68K Smalltalk code; some DBMS programs have explicitly
read in code segments and branched to them; operating systems have to
do that; JAVA; on-the-fly emulation assists; etc.
Unfortunately, optimal strategies often require that the program's intent
be communicated in ways that are very unportable, or worse, simply
don't have any APIs available.

2. API available to user-program to do this

3. User-level instructions that help

4. Operating system interfaces

5. Hardware system [uniprocessor versus SMP vs ccNUMA; I/O coherency or not;
	I-cache coherency, or not]

6. CPU and cache design [physical versus virtual, direct-mapped versus
	associative; levels; coherency; pipeline design]

Making a simple "Flush this line from an I-cache" instruction do the
right thing (under *all* the various necessary cases) is *astonishingly*
difficult... and often causes all sorts of massive wastes of effort in
order to assure flushing, or else substantial hardware that is mostly unused.

Anyway, Anton is right: nothing to laugh about :-)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: self-modifying code (Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE)
Date: 8 Aug 1998 01:15:42 GMT

In article <35cb732d.1897603@news.teleport.com>, jreiser@teleport.com
(John Reiser) writes:

(Note: John authored one of the most wonderful pieces of on-the-fly-
generated code, the bitblt code for the original Bell Labs Blit terminal,
and the issue he raises is one fo the classic difficult problems).

|> My vote for one of the top 10 mistakes in computer architecture is
|> failing to provide explicit architectural means to turn data into
|> instructions, particularly at the "medium" scale of a dozen to several
|> hundred instructions. This is what the sort generator, or graphics
|> generator, needs to make on-the-fly "generate and execute" possible.
|> And if possible, it commonly pays off at a factor of several times
|> faster execution speed.
|>
|> For instance:  The first systems based on the MIPS R2000  had separate
|> instruction and data caches, the instruction cache did not snoop the
|> data bus for invalidates, and  there were no non-privileged
|> instructions to flush or invalidate caches or cache lines.  (There
|> were privileged instructions to do the work, but no existing API.)

Actually, the privileged case was pretty tough also: you had to
swap caches, and then do partial-word writes, then swap them back).

|> In order to be sure that newly-generated code made it into the
|> instruction cache, it was necessary to flush the instruction cache "by
|> hand" by executing instructions from 4096 consecutive  words of
|> memory, thus guaranteeing  that the 16KB direct-mapped instruction
|> cache did not contain any word of newly-generated code.  This overhead
|> was a very high price to pay, and negated much of the benefit of
|> medium-scale generate-and-execute.  Of course, the chip and systems
|> implementors viewed this as a reasonable tradeoff, but it made
|> generate-and-execute miserable.
|>
|> The Intel x86 architecture has a de facto "any taken branch" to flush
|> the instruction prefetch (and the on-chip instruction cache snoops the
|> data bus for invalidates, and external caches must be unified I+D),
|> but even this is not specified precisely.  (Does it include CALL, INT,
|> RET ?)
|>
|> What I would like to see is a non-privileged  instruction which takes
|> an address and turns the cache line containing that address into "new"
|> instructions.  The execution of the instruction should require no more
|> cycles than a cache miss.  (If a cache line is neither 4 nor 8 words
|> long, then some other way to specify the length probably will be
|> necessary.)

But, unfortunately, such instructions can have *astonishing*
implementation effects for the general case, i.e., where it has to work
across different cache designs.  Having tried to get this right for
years, I've finally come to believe that the only general hope is:
	a) Some user-level hardware help, but embedded into
	b) User-level dynamic-shared libraries that cooperate with the
	   kernel, and use the extra hardware when they can so they don't have
	   to make kernel calls.

P.S. I worried about this on R2000, but couldn't come up with anything better
that was sanely implementable in the allotted time, so it wasn't an
oversight, I just couldn't invent the right thing that fit.

Actually, this makes a nice quiz question for comp.arch:

1) Propose a user-level API that satisfies this need, and
a "reasonable" implementation, and analyze the implementation pitfalls.

2) Hints:
	- separate I & D caches are interesting
		tricks that work with unified caches don't.
	- set-associative caches are interesting, since some tricks that
	  	work with direct-mapped ones don't.
	- multiprocessors are very interesting
		tricks that work with uniprocessors don't
	- multi-threaded code may be interesting
	- keeping code efficient, but guarding against a context-switch
	  at exactly the wrong place is exciting...
	- Implementing a simple instruction that works, without horrific
	  implementation effects, is very hard.

(more later, off to dinner)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: Dennis Ritchie <dmr@bell-labs.com>
Newsgroups: comp.arch
Subject: Re: self-modifying code (Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE)
Date: Sat, 08 Aug 1998 04:05:11 +0100

John R. Mashey quoted Reiser, and then remarked on the
"interestingness" and "excitingness" of dealing with
on-the-fly compilation in modern architectures:

> In article <35cb732d.1897603@news.teleport.com>, jreiser@teleport.com
> (John Reiser) writes:
>
> (Note: John authored one of the most wonderful pieces of on-the-fly-
> generated code, the bitblt code for the original Bell Labs Blit terminal,
> and the issue he raises is one fo the classic difficult problems).
>  --Mash

We've flirted with, embraced, and pulled away from the idea
of this cyclically over the years. (Ken Thompson's compilation
of code for recognizing regular expressions was one of
the early software patents (#3568156, 1971)) and Reiser's
bitblt was the first of several local implementations of the idea
for doing low-level graphics operations.

It is tough to make it the winner, probably because of the
mismatch between any kind of higher-level design and
the way things really work.

First, it can be hard to make it work even algorithmically.
Reiser's bitblt had some bugs that no one could quite figure
out how to fix (though this doesn't invalidate the idea,
and the end-conditions in bitblt are notorious; all current
browsers are a bit broken here).

More important, even though all remotely general architectures
need to have a way of putting code into memory and then
running it, doing this at small cost for small bits of
code can be hard in the presence of caching at various
levels.  Specifying a vaguely general API seems difficult.

Perhaps most important, it's hard to make a good marriage
between on-the-fly compilation and any kind of portability
even between this and next year's CPU chip, let alone across
platforms.  It's an altogether neat idea (cf. Java JIT compilers)
that will continue to be of use, but it is hard; beautiful
solutions on one platform today can turn out to be very
difficult to sustain on another platform.

	Dennis

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: self-modifying code (Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE)
Date: 8 Aug 1998 17:07:44 GMT

Note that the problem of self-modifying code is a classic:
	Code must do a lot of work, or much complexity must be added to
	hardware, or much hardware-dependency creeps into the code, in
	order to guarantee that an event that hardly ever happens,
	in fact, never happens.

For example, if a CPU has an overflow flag, but not overflow exception,
and you want to trap at any overflowing instruction, you'd have to
check the overflow flag explicitly, and you'd hardly ever find it set.

In article <01bdc259$755f0000$82dcb58f@aglew2>, "Andy Glew"
<glew@hf.intel.com> writes:

|> There's a flip side: you probably want to make the I-cache snoop DMAs,
|> so that new code read in from disk is coherent. Having to use I-cache
|> line flushing for newly read-in I/O data would be onerous, especially
|> since usually the lines won't be in the cache (but you have to be safe
|> anyway). Cache flushes that are aware of the size of the cache would be
|> desirable - a PALcode routine.
|> 
|> Now, you can argue that I/O snooping of the I-cache takes place at a
|> slow rate, whereas snooping of stores by other processors takes place
|> at a high rate. Is this enough to justify the assymmetry between the
|> I-cache being coherent to I/O data writes but not to CPU data writes?
|> Would writes by other CPUs be symmetric with I/O writes, or not?

Although it is certainly preferable that I/O be able to do the same
things as other CPUs (whether that is I-coherency as well as D-cache
coherency, or only D-cache coherency, or no coherency at all, which
has been done), the code-read case is *not* the problem, as there is a
fairly elegant solution available.

1) Assume no I/O coherence for code page-ins.

2) Assume multiprocessor design.

3) Observe that this event happens at disk rates, not CPU rates,
and in general, is only a small fraction of disk rates.

4) A fairly simple algorithm (in a UNIX context, anyway) can minimize the
number of actual cache-flushes for code-page-in; trickier versions
can use page-coloring if desired to avoid full I-cache flushes.
	a) Fundamental idea: at some point, flush all the I-caches.
	From then on, any page that you can be sure cannot have been
	used as code, is a candidate page for use to page-in code without
	a further i-cache-flush, and the trick is to maintain an
	adequate list of such pages with tolerable overhead.

	b) Keep a time-stamp for "last-time-global-i-flush".

	c) Whenever adding a page to a free-list, or otherwise putting
	it into any category where it cannot be executed as code without being
	notice by the OS, time-stampe the page.

	d) When you need a free page to do a page-in of code, select one
	whose time-stamp is older than the last I-cache flush, and if you
	can't find one, do another i-flush, and at the point, all of
	the following pages are now guaranteed not to be in the I-cache.
		- free-list
		- reclaim list
		- buffer-cache

At user-level, the analogous technique (which doesn't work for all of the
code-modification cases) is to allocate a large arena into which code
will be generated incrementally:
	a) Do whatever it takes to flush all of the I-caches.
	Note that it is a *bad* idea to allocate a huge arena, and then
	flush cache-line by cache line, addressing it through user virtual
	memory.  (if the arena is larger than the caches, you will do this
	redundantly...)
	b) Generate code incrementally, until you run out of space, then
	go back and flush again, so that any freed areas are now usable again.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog