Compiler design (Henry Spencer)

Index Home About Blog

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Subject: Re: Context sensitive scanner ?
Date: 28 Nov 1997 01:03:00 -0500
Keywords: lex

Mikael Pettersson  <Mikael.Pettersson@sophia.inria.fr> wrote:
>1. A simple low-level scanner generates a stream of tokens.
>2. A "transformation" scanner is a coroutine with an input stream
>   and an output stream... Push one transformation scanner for each
>   individual kind of transformation...
>This strategy was quite successful... Haven't seen anything like this
>described in the compiling  litterature, though.

I do not recall seeing the idea of pushing little coroutines, one per
syntactic blemish, before.  However, the idea that there might be more
than one layer of scanner is quite old.

When I was in grad school, about 20 years ago, one of my profs preferred
a model of a compiler in which the parser was preceded by two separate
phases:  the scanner (which partitioned the input text into tokens) and
the screener (which tidied up the token stream, doing things like deleting
non-significant white space and recognizing keywords).

I've also implemented several systems in which the scanner is followed by
an error-repair layer which interfaces the scanner to the (top-down)
parser, hiding syntax errors from the parser.  (One of the nice things
about a top-down parser is that it can tell you what tokens it thinks
might come next, and if what you've got isn't one of those, you can give
it what it wants and then invoke heuristics to try to resynchronize the
input with the parser... without the parser ever having to know.  The
parser always sees a syntactically correct program, and the semantic
routines intertwined with the parser never have to suddenly stop and rip
out everything they've done when the parser detects an error.)

A slight digression...  More recently, I've discovered that a split into
multiple layers can do wonders for parsing even simple-looking notations
like regular expressions.  Most RE implementations, including a couple I
wrote, just charge in and do it all in one layer... but splitting it into
two layers, with a distinct scanner underneath the parser, turns out to do
wonders for handling real RE syntaxes (which are rather messier than the
ones in the textbooks).  The extra leverage provided by the layer split
really quite surprised me; all kinds of nuisances like context-dependent
metacharacters (e.g., $ is magic only at the end of the RE) turn out to
require only a line or two of code... *if* it doesn't have to be embedded
in the middle of a parser, or *if* it can look ahead to the next token
instead of just the next character.

Actually, I guess that wasn't a digression after all.  I think there's a
general moral to be drawn.  When you add a layer split, what you're buying
is separation of concerns:  the simple and narrow interface between the
layers hides the complexity of the previous or following layer.  That way,
code to deal with some oddity can see a simple environment, where a little
bit of code can have a lot of leverage.
--
If NT is the answer, you didn't                 |     Henry Spencer
understand the question.  -- Peter Blake        | henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Subject: Re: Q: regarding regular grammars ...
Date: 28 Nov 1997 01:05:39 -0500
Keywords: lex

Michael Roach  <mcr@yuck.net> wrote:
>I thought regular expressions were regular grammars, is that assumption
>wrong?  And if so, could you explain the differences or point me to some
>references...

Regular expressions are regular grammars, written as "one-liners" by using
more powerful notation than the usual sort of grammar.  For example, you
can write "(A|B)C*(D|E)", or you can write:

	<phrase> ::= <modifiers> D
	<phrase> ::= <modifiers> E
	<modifiers> ::= <modifiers> C
	<modifiers> ::= <start>
	<start> ::= A
	<start> ::= B

which is a regular grammar of the strictest sort.  The two are equivalent,
although the transformation between them can be tedious.

(One note of caution:  "regular expressions" in computing practice have
wandered away from their theoretical roots to some degree, and sometimes
now include constructs which are not "regular" in the theoretical sense.)
--
If NT is the answer, you didn't                 |     Henry Spencer
understand the question.  -- Peter Blake        | henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Subject: Re: Q: regarding regular grammars ...
Date: 30 Nov 1997 22:53:35 -0500
Keywords: lex

Karsten Nyblad  <karsten@tdr.dk> wrote:
>Yes, it is wrong.  Regular grammar are a superset of context free
>grammars (CFG).

By whose definition?  The definition of regular grammars that I learned,
a long time ago, is that they are CFGs in which all rules are of one of
two forms:

	N ::= T
	N ::= M T

where N and M are nonterminal symbols and T is a terminal symbol.  (It
makes no substantial difference if the second RHS is "T M"  instead.)
This is a severe subset of CFGs -- in fact, regular grammars are Chomsky's
type 3, where CFGs are type 2 and completely unrestricted grammars are
type 0.  Regular grammars are equivalent in power to classical textbook
regular expressions, which is how regular expressions got their name.

See, for example, pages 38-41 of the classic "Compiler Construction, an
Advanced Course", ed. Goos&Hartmanis, Springer-Verlag 1974.
--
|     Henry Spencer
| henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Subject: Re: Bottom-up versus Top-down
Date: 2 Dec 1997 12:09:26 -0500
Keywords: C, parse

Mark K. Gardner <mkgardne@cs.uiuc.edu> (or perhaps our moderator) wrote:
>[I did not witness the events first-hand, but it appears that LALR
>grew strength from languages, like C, that were difficult if not
>impossible to cast as LL grammars, while LL took root in the Pascal
>languages...

This is actually somewhat amusing, because the first C compiler in fact
used a top-down parser, and one of the early implementors of a bottom-up
C parser (Steve Johnson) complained in print that the language was designed
for top-down and was awkward to parse bottom-up!

Speaking as someone who has written a top-down parser for full ANSI C,
it's a little messy in spots, and there are a couple of places where you
need to cheat with a bit of extra lookahead, but it's neither impossible
nor immensely difficult.  Annoying, yes, but entirely feasible.
--
|     Henry Spencer
| henry@zoo.toronto.edu
[Hey, the Ritchie C compiler wasn't entirely top down.  It did expressions
bottom up with a little operator precedence grammar. -John]


--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Subject: Re: Bottom-up versus Top-down
Date: 30 Nov 1997 22:53:02 -0500
Keywords: parse

George C. Lindauer <gclind01@spd.louisville.edu> wrote:
>...Top down parsing is much easier to implement, but,
>bottom-up parsing is more efficient.  YACC and other code generators
>tend to generate the type of state tables required to do bottom-up
>parsing efficiently...

Exactly the same thing can be done for top-down parsing, although it's
less commonly seen.  There is no fundamental reason for there to be
any efficiency difference.

The real dichotomy between the two is that top-down is less powerful
in terms of the grammars it can handle, but more powerful in terms of
what it can do in the way of supporting semantics etc.  The reason is
simply that top-down always knows the context of the current tokens --
what the higher-level constructs around them are -- while bottom-up
discovers this afterward.

Top-down is restricted to grammars where it is possible to determine
the context beforehand, where the nature of a construct can be
determined by inspecting its beginning.  For example, top-down with
the usual one token of lookahead has difficulty with some aspects of C
expression syntax, where "(" may introduce a parenthesized
subexpression [a + (b * c)] or a unary conversion [a + (unsigned) b],
and a top-down parser has to decide which way to go before seeing what
follows.  (The usual fix for this is to cheat slightly, adding a bit
more lookahead to see whether the next token is part of a type name or
not.)

On the other hand, because top-down knows the context at all times, it
can exploit that information.  For example, the operand of the C
"sizeof" operator can be an expression, but it is not evaluated --
only the type of its result matters -- and a top-down parser can
switch off code generation while parsing the operand.  Bottom-up
parsers often have to postpone work, building intermediate data
structures to remember information that upper levels may want to use,
because they don't know the context well enough to do anything with it
immediately.  (Of course, this may not look like a big penalty if the
architecture of a multi-pass compiler dictates that those structures
be built anyway.)

My own feeling, for what it's worth, is that top-down parsers are
generally underrated, but that the two types are good at different
things.  Bottom-up's greater parsing power makes it the method of
choice for doing experimental work -- e.g., tinkering with a new
notation -- or for dealing with pre-existing languages with difficult
syntax.  Top-down's limited power increases the investment needed to
make it work, but gives more leverage for semantics once it does, so
it can be a better tool for "production" work with a stable and
well-behaved input language.  And in real life, the choice can be
dictated one way or the other by available tools: bottom-up tools are
more common, but if circumstances dictate doing without any
parser-generation tools at all, top-down is much easier to hand-code.
--
|     Henry Spencer
| henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers,comp.theory
Subject: Re: Translation between high level languages
Date: 30 Nov 1997 22:54:08 -0500
Keywords: translator

Our moderator wrote:
>[The issues in translating to a high level language aren't all that different
>from translating to assembler, except that people care a lot more about the
>appearance and structure of the output code...

In fact, there are two distinct classes of such translators, based on
whether the translator attempts to produce human-maintainable code, or
doesn't.  It's worth making a distinction between *compiling* to
another language and *translating* to it, where the latter implies (as
in the translation of human languages) that the result is readable
without reference to the original.

Simply compiling from one language into another is sometimes not too
problematic, as witness f2c, which does quite a good job of turning
Fortran 77 into C.  (I've used f2c for a few production applications.)
However, a straightforward approach to this typically does not make
the output code readable and maintainable enough that you could throw
away the original source... which is often what people want to do.

Full and proper translation, with maintainable output, is a very hard
problem; I'm told that existing tools aimed at doing this typically
are interactive and rely heavily on human assistance.
--
|     Henry Spencer
| henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com, meta-mail to
compilers-request@iecc.com.  Archives at http://www.iecc.com/compilers

Subject: Re: if then else shift/reduce Syndrome
From: Henry Spencer <henry@zoo.toronto.edu>
Date: Feb 27 1996
Keywords: yacc, parse
Newsgroups: comp.compilers

Michael Meissner <meissner@cygnus.com> writes:
>...if you don't have features like %left, %right and/or
>%nonassoc, you need to specify separate rules for each level of
>precedence.  Thus a simple statement:
>	a = 1;
>Can generate 13 or more reductions in C...

Only if the grammar is naively written.  Although it is significantly
more work, it *is* possible to write a grammar that is optimized for
the common cases, and doesn't do the massive recursive plunge for
simple statements.  (My terminology here reflects my life-long
preference for recursive-descent parsers and their table-driven
equivalents, but I believe the same thing can be done for bottom-up
parsing.)

The way to do it is to short-circuit the common cases, and invoke the
full machinery only if (for example) that `1' is followed by an
operator.  This does make things messy -- you've got to do some
backing and filling to cope with that unwelcome operator -- but if
you're patient and careful, it can be done.  The resulting grammar is
not as clean and elegant as the one in the language spec, especially
for a language with many precedence levels, but it parses real
programs much more rapidly.

This trick has actually been around as folklore for some time.  I
don't recall ever seeing it published, although admittedly I'm more
than slightly behind on my reading.  I first ran into it in internal
documentation of the SP/k project at University of Toronto in 1977.
--
Henry Spencer, henry@zoo.toronto.edu
--
Send compilers articles to compilers@iecc.com,
meta-mail to compilers-request@iecc.com.

Index Home About Blog