Derived from SERC-TR--76F, Software Engineering Research Center, University of Florida, CSE-301, Gainesville, FL 32611, July/94

Grammar-Based Programming

When you say the word parsing to most programmers the immediate word-association response is "compiler writing". Computer Science students are usually introduced to grammars in a compilers course that focuses on this one narrow application of parsing tools. Since few programmers write compilers after leaving university, we tend to think of parsing as a solution to somebody else's problem.

But parsing technology can be applied in a wide range of situations in which a program is controlled by a sequence of inputs. Parser generators can thus be used to provide clean, maintainable code to solve many programming problems.

A parser generator reads a syntax file containing a grammar and associated actions specified for both normal and error inputs. From these it creates a C or C++ source program. This generated program, when compiled and executed, will accept sequences of inputs that match the grammar and carry out the corresponding actions. If the inputs do not match the grammar, the program will instead perform the specified error handling actions.

Maintainability Benefits of Grammar-Based Programming

The main benefits of organizing a program around a grammar are readability, robustness, flexibility and portability. Together these make for more maintainable code.

Readability

It is generally easy to read and understand a grammar based program because the grammar explains most of the complicated logic the reader might need to worry about. For example, the key part of the AnaGram syntax file in Appendix I, is the following:

grammar 
  -> line?..., eof = writeUnidentifiedFileList ();

line 
  -> trace line, newline 
  -> source line, newline 
  -> other line, newline 

trace line 
  -> "!trace on lines at ", location, not eol?...

location 
  -> my file:i, function spec, line spec:n = writeR2tLine (i, n); 
  -> routine entry point 
  -> meaningless number 
  -> unidentified location 
...

This tells us immediately that the input (called grammar) is expected to be a sequence of lines and that each line may be a trace line, a source line, or an other line. Trace lines start with the string "!trace on lines at " which is followed by location information. If the location is composed of a file name, followed by a function name followed by a line number then the writeR2tLine function is called. Other cases, such as a routine entry point are ignored.

While we may not understand immediately what an r2t line is, the overall behavior of the program is very clear. This program is scanning the input for trace lines and writing something when it finds each one. The grammar provides an excellent top down specification of what it expects as input and what it will do when it gets it.

Robustness

In a conventional program it can be very difficult to check that all contingencies have been handled. What if there are extra blanks in the input? What if the last input line is not complete? Special cases such as these are often found during testing and handled by patching the program's logic. After a few such patches the program becomes incomprehensible.

In a grammar based program, the grammar defines behavior for all such cases. The parser generator then checks for consistency and warns the programmer of possible errors. We have found that many of our logic errors are caught by the parser generator and relatively few get through into testing of the final program.

Flexibility

Grammar based programs can often be extended easily to handle new cases. Often the addition of a couple of alternatives to the grammar is sufficient to take care of a new requirement. Such a change could be very complex to introduce in a conventional program because so many of the program's states could be affected. But the parser generator hides this kind of complication and generates a new, correct, program from the modified grammar.

This flexibility can be especially useful in exploratory programming [WILD.94]. For example, the program in Appendix I takes output from a debugger and translates it for use in one of our projects. The exact form of the debugger's output was known only from examples. As a first cut at the problem the grammar in the appendix was written from one of these examples. The error handling features of the parser generator can then be used to systematically develop a complete and correct program.

The grammar contains the following production for error handling:

other line
  -> error = wrError (PCB.line, PCB.column);

The token error covers anything else not described by this grammar. Any input line that does not match the grammar will produce a call to the wrError function that writes a message to standard error. The offending input can then be tracked down and added to the grammar.

We have used this technique several times. The result is a program that is easy to understand despite the many revisions. The grammar acts as a formal but readable specification at all stages of the process.

Portability

Grammar-based programs usually are designed to process simple ASCII text as input and thus tend to be easily portable. So far, AnaGram generated programs have worked with all of the C or C++ compilers we have tried. Since it can be difficult to predict all the environments a program may eventually face, a conservative grammar-based design can reduce the life-cycle cost of repeated porting.

A Sample AnaGram Syntax File (Simplified)

{
/* File: r2getvd.syn, By: N. Wilde, Nov. 14/93
(mangled by smilex@RusNet, Nov. 17-18/2008)
Purpose: Get Vax Debugger input for RECON.
This is an AnaGram syntax file that generates r2getvd.c, which
converts a log file containing trace output from the VAX debugger
to a RECON "<xx>.r2t" trace file. To help
check consistency, r2getvd writes to standard output a list of
all the file names found in the log file that it could NOT
identify. If any log file line has an unexpected format, a message
is written to standard error.

Design: The Vax debugger log output contains, along with much else,
lines like the following:
!trace on lines at RPN\getop\%LINE 577+11
that indicate that line 577 of function getop in file RPN.C has
been executed. This input line should produce a r2t file line with:
   004 577T
where 004 is the file index of file RPN.C. The mapping between file
indexes and file names is in the "my file" productions in this file. 

R2getvd reads the debugger log file. When it finds a line of
the form shown above it looks up the corresponding file index and
writes the corresponding r2t file line. If the line is of the form
!trace on lines at XXXX ...
but XXXX is not one of the known file names, then XXXX is stored in
a list and written out at the end of processing in the list of
unidentified files.
*/
}
?
grammar
  -> line?..., eof = writeUnidentifiedFileList ();

line
  -> trace line, newline
  -> source line, newline
  -> other line, newline

trace line
  -> "!trace on lines at ", location, not eol?...

location
  -> my file:i, function spec, line spec:n = writeR2tLine (i, n);
  -> routine entry point
  -> meaningless number
  -> unidentified location

(int) my file
  -> "RPN" = 004;
/* ADD other abbreviated file names here */

function spec
  -> not percen...

(int) line spec
  -> "%LINE ", number:n = n;

routine entry point
  -> "routine"

unidentified location
  -> identifier = storeUnidentifiedFile (release ());

meaningless number
  -> number

source line
  -> "! ", blank?..., number, ":", not eol?...

other line
  -> error = wrError (PCB.line, PCB.column);

(int) number
  -> digit:d = d - '0';
  -> number:n, digit:d = 10*n + (d -'0');

identifier
  -> identifier start character:c = collectFirst (c);
  -> identifier, identifier follow character:c = collect (c);

/* lexical definitions */
newline = '\n'
eof = -1 + 0 + ^Z
backslash = '\\'
percen = '%'
blank = ' '
digit = '0-9'
identifier start character = 'a-z' + 'A-Z' + '_' + '$' + '@'
`identifier follow character = identifier start character + digit
not percen = ~(percen + newline + eof)
not eol = ~(newline + eof)

/* C Support Code */
{
?
}

/* ----------- main program ------------ */
#define PATH_LEN 100    /* Should be PATH_MAX? */

int main (int argc, char *argv[]) {
    int i;
    char logFilePath[PATH_LEN];
    char r2tFilePath[PATH_LEN];

    for (i = 1; i < argc; ++i) {
        if (0 == strncmp (argv[i], "-L", 2)) {
            strncpy (logFilePath, argv[i] + 2, PATH_LEN);
            if (NULL == (logFile = fopen (logFilePath, "r"))) {
                fprintf (stderr, "R2GETVD - couldn't open log file %s\n",
                    logFilePath);
                return 1;
            }
        }

        if (0 == strncmp (argv[i], "-R", 2)) {
            strncpy (r2tFilePath, argv[i] + 2, PATH_LEN);
            if (NULL == (r2tFile = fopen (r2tFilePath, "w"))) {
                fprintf (stderr, "R2GETVD - couldn't open r2t file %s\n",
                    r2tFilePath);
                return 1;
            }
        }
    }

    if ((NULL == r2tFile) || (NULL == logFile)) {
        fprintf (stderr, "R2GETVD ABORTING\n");
        return 1;
    }

    /* Write comment line of the r2t file */
    fprintf (r2tFile, "From Vax Log file %s\n", logFilePath);

    /* Call the parser to do the rest */
    r2getvd ();

    /* Close all files and quit */
    fclose (logFile);
    fclose (r2tFile);

    return 0;
}