Last Update Time-stamp: "97/06/30 13:50:05 umrigar"
This is a brief intuitive introduction to recursive descent parsing. Any compiler text should provide more details. A elementary introduction to grammars and language analysis is also available.
Given a grammar, consider how one could write a parser for it. One possible approach would proceed as follows:
We know that all strings in the language must be derived from the start
nonterminal S. Hence if we can write a procedure S()
which matches all strings derived from S, then we are done.
But how do we do this?
Let's generalize the above question somewhat. Instead of merely posing
the problem of writing a procedure which matches strings derived from the
specific start nonterminal S, let's look at the more general
problem of writing a procedure which matches strings derived from any
nonterminal N; i.e., for each nonterminal N, we would
like to construct a parsing procedure N()
which matches all
strings derived from N.
If a nonterminal N has only a single grammar rule, then its parsing procedure is easy to write. To match the rule we know that we need to match each grammar symbol on its right-hand side (RHS). There are two possibilities:
Hence a parsing function N()
for nonterminal N
will simply contain logic to pick a rule based on the current lookahead,
followed by code to match each individual rule. Its structure will look
something like the following:
N() { if (lookahead can start first rule for N) { match rule 1 for N } else if (lookahead can start second rule for N) { match rule 2 for N } ... ... else if (lookahead can start n'th rule for N) { match rule n for N else { error(); } }
Unfortunately, there are some problems with this simple scheme. A grammar rule is said to be directly left recursive if the first symbol on the RHS is identical to the LHS nonterminal. Hence, after a directly left recursive rule has been selected, the first action of the corresponding parsing procedure, will be to call itself immediately, without consuming any of the input. It should be clear that such a recursive call will never terminate. Hence a recursive descent parser cannot be written for a grammar which contains such directly (or indirectly) left recursive rules; in fact, the grammar cannot be LL(1) in the presence of such rules.
Fortunately, it is possible to transform the grammar to remove left recursive rules. Consider the left recursive nonterminal A defined by the following rules:
A: A alpha A: betawhere alpha is nonempty and alpha and beta stand for any sequence of terminal and nonterminal grammar symbols. Looking at the above rules, it is clear that any string derived from A must ultimately start with beta. After the beta, the rest of the string must either be empty or consist of a sequence of one or more alpha's. Using a new nonterminal A' to represent the rest of the string we can represent the transformed grammar as follows:
A: beta A' A':/* empty */ A': alpha A'The above rules are not left recursive.
Consider the following simple grammar for assignment statements with expressions involving addition or subtraction:
1) stmt: ID ':=' expr 2) expr: expr '+' NUM 3) expr: expr '-' NUM 4) expr: NUMThis grammar allows statements like
a:= 1 + 2 - 3
.
Since the rules for expr
are left recursive, we need to
transform the grammar to the following:
1) stmt: ID ':=' expr 2) expr: NUM exprRest 3) exprRest: '+' NUM exprRest 4) exprRest: '-' NUM exprRest 5) exprRest: /* empty */Assuming that the current lookahead is in a global variable
tok
, we can write the parsing procedures in pseudo-C as
follows:
stmt() { match(ID); match(':='); expr(); } expr() { match(NUM); exprRest(); } exprRest() { if (tok == '+') { /* rule 3 */ match('+'); match(NUM); exprRest(); } else if (tok == '-') { /* rule 4 */ match('-'); match(NUM); exprRest(); } else { /* rule 5 */ } }Note that
exprRest()
chooses rule 5 if the lookahead cannot
match any of the other rules for exprRest
. It would have also
been correct to choose rule 5 only if the current lookahead was a token
which can follow exprRest
, but the method used above
is simpler with the sole disadvantage that errors are detected later.
The match()
procedure merely needs to check if the current
lookahead in tok
matches its argument and advance the input:
match(Token t) { if (tok == t) { tok= nextToken(); } else { error(): } }where we assume that
nextToken()
is the interface to the
scanner and returns the next token from the input stream.
All that remains is to make sure we startup and terminate the parser correctly:
parse() { tok= nextToken(); stmt(); match('<EOF>'); }The first line merely primes the global variable used for the lookahead token. The second line calls the parsing procedure for
stmt
.
Finally, assuming that '<EOF>'
is the token used to
represent the end of the input stream, the last line ensures that no garbage
follows a legal stmt
.
Copyright (C) 1997 Zerksis D. Umrigar
Feedback: Please email any feedback to zdu@acm.org.
Up to Parsing Demos Main Page