Introduction to Recursive Descent Parsing

Last Update Time-stamp: "97/06/30 13:50:05 umrigar"

This is a brief intuitive introduction to recursive descent parsing. Any compiler text should provide more details. A elementary introduction to grammars and language analysis is also available.

Given a grammar, consider how one could write a parser for it. One possible approach would proceed as follows:

We know that all strings in the language must be derived from the start nonterminal S. Hence if we can write a procedure S() which matches all strings derived from S, then we are done. But how do we do this?

Let's generalize the above question somewhat. Instead of merely posing the problem of writing a procedure which matches strings derived from the specific start nonterminal S, let's look at the more general problem of writing a procedure which matches strings derived from any nonterminal N; i.e., for each nonterminal N, we would like to construct a parsing procedure N() which matches all strings derived from N.

If a nonterminal N has only a single grammar rule, then its parsing procedure is easy to write. To match the rule we know that we need to match each grammar symbol on its right-hand side (RHS). There are two possibilities:

If the RHS symbol is a terminal symbol, check that the current lookahead matches that terminal symbol. If it does, then advance the input, setting the current lookahead to the next input symbol. If it does not, signal an error.
For each occurrence of a nonterminal symbol on the RHS, call the corresponding parsing procedure. If the parsing procedure works as posited above, then it will match a prefix of the input with a rule for the nonterminal.

If a nonterminal N has multiple grammar rules, then the parsing procedure will need to decide which rule to use. It can do so by looking at the current lookahead token to see which of the candidate rules can start with the lookahead. If only a single rule can start with the lookahead, then that rule is chosen. If it is always possible to predict a unique rule for any nonterminal by looking ahead by at most one token, then the grammar is said to be LL(1).

Hence a parsing function N() for nonterminal N will simply contain logic to pick a rule based on the current lookahead, followed by code to match each individual rule. Its structure will look something like the following:

	N() {
	  if (lookahead can start first rule for N) {
	    match rule 1 for N
	  }
	  else if (lookahead can start second rule for N) {
	    match rule 2 for N
	  }
	  ...		...
          else if (lookahead can start n'th rule for N) {
	    match rule n for N
	  else {
	    error();
	  }
	}

Unfortunately, there are some problems with this simple scheme. A grammar rule is said to be directly left recursive if the first symbol on the RHS is identical to the LHS nonterminal. Hence, after a directly left recursive rule has been selected, the first action of the corresponding parsing procedure, will be to call itself immediately, without consuming any of the input. It should be clear that such a recursive call will never terminate. Hence a recursive descent parser cannot be written for a grammar which contains such directly (or indirectly) left recursive rules; in fact, the grammar cannot be LL(1) in the presence of such rules.

Fortunately, it is possible to transform the grammar to remove left recursive rules. Consider the left recursive nonterminal A defined by the following rules:

	A: A alpha
	A: beta

where alpha is nonempty and alpha and beta stand for any sequence of terminal and nonterminal grammar symbols. Looking at the above rules, it is clear that any string derived from A must ultimately start with beta. After the beta, the rest of the string must either be empty or consist of a sequence of one or more alpha's. Using a new nonterminal A' to represent the rest of the string we can represent the transformed grammar as follows:

	A: beta A'
	A':/* empty */
	A': alpha A'

The above rules are not left recursive.

An Example

Consider the following simple grammar for assignment statements with expressions involving addition or subtraction:

1)	stmt:	ID ':=' expr
2)	expr:	expr '+' NUM
3)	expr:	expr '-' NUM
4)	expr:	NUM

This grammar allows statements like a:= 1 + 2 - 3. Since the rules for expr are left recursive, we need to transform the grammar to the following:

1)	stmt:		ID ':=' expr
2)	expr:		NUM exprRest
3)	exprRest:	'+' NUM exprRest
4)	exprRest:	'-' NUM exprRest
5)	exprRest:	/* empty */

Assuming that the current lookahead is in a global variable tok, we can write the parsing procedures in pseudo-C as follows:

stmt() {
  match(ID); match(':='); expr();
}

expr() {
  match(NUM); exprRest();
}

exprRest() {
  if (tok == '+') { /* rule 3 */
    match('+'); match(NUM); exprRest();
  }
  else if (tok == '-') { /* rule 4 */
    match('-'); match(NUM); exprRest();
  }
  else { /* rule 5 */
  }
}

Note that exprRest() chooses rule 5 if the lookahead cannot match any of the other rules for exprRest. It would have also been correct to choose rule 5 only if the current lookahead was a token which can follow exprRest, but the method used above is simpler with the sole disadvantage that errors are detected later.

The match() procedure merely needs to check if the current lookahead in tok matches its argument and advance the input:

match(Token t) {
  if (tok == t) {
    tok= nextToken();
  }
  else {
    error():
  }
}

where we assume that nextToken() is the interface to the scanner and returns the next token from the input stream.

All that remains is to make sure we startup and terminate the parser correctly:

parse() {
  tok= nextToken();
  stmt();
  match('<EOF>');
}

The first line merely primes the global variable used for the lookahead token. The second line calls the parsing procedure for stmt. Finally, assuming that '<EOF>' is the token used to represent the end of the input stream, the last line ensures that no garbage follows a legal stmt.

Feedback: Please email any feedback to zdu@acm.org.

Up to Parsing Demos Main Page