Derived from ANTLR. by Ashley J. S. Mills

Language Recognition

Background Information

ANTLR is a compiler tool hence it's developer base is generally constrained to those whom desire to create translators of some kind. In order to comprehend much of what will be discussed in this tutorial it is necessary to first get a feel of the terminology used in this area of computer science and the basic concepts behind the operation of ANTLR. This section will begin with a brief discussion of how a compiler operates.

The Lexer

Other names: Scanner, lexical analyser, tokenizer.

Programming languages are made up of keywords, and strictly defined constructs, the ultimate aim of the compilation process is to translate the high level instructions of the programming language into the low-level instructions of the machine or virtual machine that is the intended execution architecture. For example, a native C++ compiler compiles C++ code into machine language instructions that execute directly on the target hardware (or on some simulation of the target hardware), the standard Java compiler distributed by Sun Microsystems compiles Java source code to Java bytecode which is the machine language instruction set used by the Java virtual machine, this bytecode can then be executed by any platform that implements the Java virtual machine.

A source program is written using some kind of editing tool that can produce a file which is comprised of statements and constructs that are allowed in the programming language being used. The actual text of the file is written using characters of a particular character set or subset of some character set, so a source file can be thought of as a stream of characters terminated by some EOF (End Of File) marker that signifies the end of the source file.

A source file is streamed to a lexer on a character by character basis by some kind of input interface. The lexers job is to quantify the meaningless stream of characters into discrete groups that, when processed by the parser, have meaning. Each character or group of characters quantified in this manner is called a token. Tokens are components of the programming language in question such as keywords, identifiers, symbols, and operators. (Usually) The lexer removes comments and whitespace from the program, and any other content that is not of semantic value to the interpretation of the program. The lexer converts the stream of characters into a stream of tokens which have individual meaning as dictated by the lexer's rules. Similarly, your brain is probably grouping the individual characters that make up each of the words in this sentence into tokens (words in this case, which have semantic value to you), your job of determining where one token finishes and another begins is made a little easier however, because the words in a sentence are already separated by spaces, it could be argued that an English sentence is already tokenised in this sense, however, we can assume that some kind of grouping and recognition is occurring at the word level too. The stream of tokens generated by the lexer are received by the parser.

A lexer usually generates errors pertaining to sequences of characters it cannot match to a specific token type defined by one of it's rules.

The Parser

Other Names: Syntactical analyser.

A lexer groups sequences of characters it recognises in the character stream into tokens with individual semantic worth, it does not consider their semantic worth within the context of the whole program, this is the job of the parser. Languages are described by a grammar, the grammar determines exactly what defines a particular token and what sequences of tokens are decreed as valid. The parser organises the tokens it receives into the allowed sequences defined by the grammar of the language. If the language is being used exactly as is defined in the grammar, the parser will be able to recognise the patterns that make up certain structures and group these together. If the parser encounters a sequence of tokens that match none of the allowed sequences of tokens, it will issue an error and perhaps try to recover from the error by making a few assumptions about what the error was.

The parser checks to see if the tokens conform to the syntax of the language defined by the grammar. Similarly your brain knows what kinds of sentences are valid within a particular language such as English and it could be said that at this moment in time your brain is parsing the words of this sentence and grouping them into what you understand as valid sequences, for instance, your brain knows that a sentence ends when a full stop is encountered, one would not assume that the text following the full stop was part of the same sentence. In addition to this your brain is also extracting meaningful information from the sentence. Usually the parser will convert the sequences of tokens that it has been deliberately created to match into some other form such as an Abstract Syntax Tree (AST). An AST is easier to translate to a target language because an AST contains additional information implicitly, by nature of it's structure. Effectively, creating an AST is the most important part of a language translation process.

The parser generates one or more symbol table(s) which contain information regarding tokens it encounters, such as whether or not the token is the name of a procedure or if it had some specific value, the symbol tables are used in the generation of object code and in type checking, for example, so that an integer cannot be assigned to a string or whatever. ANTLR uses symbol tables to speed up the matching of tokens, in that an integer is mapped to a particular token, then instead of matching the string that would compose a textual description of that token, the integer that represents that token is matched instead, which is a lot quicker. Eventually the AST will be translated to an executable format, some linking of libraries may be performed, this is not considered the job of the compiler and is not of direct concern here.

A parser usually generates errors pertaining to sequences of tokens it cannot match to the specific syntactical arrangements allowed, as decreed by the grammar.

Both lexers and parsers are recognizers, lexers recognize sequences of characters, parsers recognize sequences of Tokens. A lexer or a parser converts a stream of elements (be they characters or tokens) and translates them to some other stream of elements such as tokens representing larger structures or groups of elements or perhaps nodes in an abstract syntax tree. They are essentially the same thing, however, traditionally lexers are associated with processing streams of characters and parsers are associated with processing streams of Tokens.

It is recommended that you read Building Recognizers By Hand by Terrance Parr the creator of ANTLR, to get an insight into how one would go about creating a recogniser in Java, and from this you can abstract how it can be done in any programming language. When you have the time you should read all the documentation that came with the ANTLR installation.

What is ANTLR's part in all this?

ANTLR lets you define the rules that the lexer should use to tokenize a stream of characters and the rules the parser should use to interpret a stream of tokens. ANTLR can then generate a lexer and a parser which you can use to interpret programs written in your language and translate them other languages and AST's. The design of ANTLR affords much extensibility and it has many applications.