Lexical analysis is the process of breaking a program into lexemes. A lexeme is a single, indivisible unit in a program.
For example, the following code:
for (int i = 0; i < 10; i++) {
// some comment here
array[i] = array[i] + 2.5;
}
Contains the following lexemes:
Lexemes come in different types called tokens. Tokens include identifiers, reserved words, numbers, strings and symbols.
Identifiers are one of the most common tokens in programs. They are typically composed of ASCII letters, numbers and underscores.
Languages have different rules regarding the form of valid identifiers. C, C++, Ruby and Python use the following rules:
Java also allows dollar signs to be included.
Haskell allows single quotes.
LISP languages allow hyphens, question marks and more.
Keywords are the words in a program that have a special meaning such as "if", "while", "throw" etc.
Reserved words are words that cannot be used as identifiers names. In most modern languages, they are the same thing.
Numbers are another token. Most languages have similar rules about numbers in programs. Integers are simply written as a sequence of digits. Floating point numbers are written using a decimal point or scientific notation.
C and C++ also allow trailing letters to signify the data type of the constant. They also allow for octal and hexadecimal values. The following are all valid numeric constants in C:
42 | decimal |
052 | octal |
0x2A | hex |
42L | long int |
42LL | long long int |
42.0 | double |
42.0f | float |
4.2e10 | scientific |
4.2e10f | scientific float |
The lexical analyzer must be able to recognize every representation for these.
Comments are also handled by the lexer. Different languages have different rules for how comments look. Some have a start symbol and end symbol such as /* and */ in C. Others have only a start symbol and go through the end of the line such as // in Java and # in Python.
Older languages tended to use keywords for comments instead of symbols such as the "REM" in BASIC or "C" in FORTRAN.
Comments with begin and end symbols typically have issues with nested comments.
Strings are typically enclosed in " or ' characters, and contain characters in between.
Perl has two different types of string literals. Strings in single quotes are taken verbatim, while strings in double quotes are interpolated.
Other types of tokens include arithmetic and logic operators:
+ - * / % & | ^ ~ ! <...
as well as structural symbols:
() [ ] { } begin end...
Lexical analysis often ignores whitespace, but there are some cases where it is important. In Python, indentation is used to control blocks instead of braces. To implement this, a lexer must keep track of the indentation level and insert extra INDENT and DEDENT tokens.
There are two ways to write a lexer:
In this class, we will use the ANTLR system which provides both lexer and parser generators. We will look at parsers next week, but use ANTLR for lexer generation now.
ANTLR is installed on the CPSC server, but you must add it to your
CLASSPATH
environment variable. This variable tells the Java
virtual machine where to look for classes that are referenced. Since our
programs will rely on ANTLR classes, we need to tell the JVM where to find them.
To do this, put the following into your .bash_profile
file:
export CLASSPATH=".:/usr/local/lib/antlr-4.13.1-complete.jar:$CLASSPATH"
We also will make aliases for running the ANTLR tool and test program. This will
allow us to invoke these classes from the command-line more conveniently. Add this
to the .bash_profile
as well:
alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.13.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool'
alias grun='java -Xmx500M -cp "/usr/local/lib/antlr-4.13.1-complete.jar:$CLASSPATH" org.antlr.v4.gui.TestRig'
We should now be set up to use both the ANTLR tools and runtime library!
Regular expressions, or "regexes" are patterns which match sequences of text. There are many slightly different sytaxes for them. Here we will talk about the regex patterns allowed in ANTLR:
'while'
matches the keyword "while" and
nothing else.[+-*/]
would match any of those 4 operators and [aeiou]
would match any vowel.[0-9]
matches any single digit and [a-zA-Z]
matches any letter.'+' | '-'
will match either a plus or
minus sign.a*
matches "", "a", "aa", "aaa", etc. The regex [0-9]*
matches any string of digits (including
zero digits). [0-9]+
matches any string of digits, but enforces that at least one must exist.'-'?[0-9]+
matches a string of digits which may or may not begin with
a negative sign.~'"'
matches any character
except for a double quote and ~[A-Z]
matches anything except for a capital letter.('a' | 'b')*
will match any strings of the letters a and b. On
the other hand 'a' | 'b'*
will match either a single a, or any string of b's.ANTLR lexers are created by making a .g4 file, containing lexer rules. These lexer rules contain the name of the token being matched (which must begin with a capital letter), followed by a colon. Next comes a regular expression giving the pattern to match, and finally a semi-colon.
The calculator lexer example contains lexical rules for matching floating point numbers, and a few symbols needed to write a calculator program:
lexer grammar CalculatorLexer;
PLUS: '+';
MINUS: '-';
DIVIDE: '/';
TIMES: '*';
LPAREN: '(';
RPAREN: ')';
PI: 'pi' | 'PI';
DONE: '\n';
REALVAL: '-'?[0-9]*'.'?[0-9]+;
WS: [ \t]+ -> skip;
Notice that the symbols are all defined with a regular expression matching a single character. The "PI" symbol matches the text "pi" or "PI".
The only really complex regular expression in this lexer is the "REALVAL" token which matches:
Also notice the -> skip
. This tells ANTLR not to actually produce
the WS token in the token stream, but simply discard it. This is done for whitespace
here and is also typically done for comments in program code.
There are algorithms for transforming regular expressions into a computing structure called a "finite state machine". An FSM keeps track of what state we are in, and makes a transition to a new state for each character read. Below is a state machine recognizing an identifier using C rules:
Each new character takes us into another state. When we reach a state that marks the end of one of our tokens, we produce that lexeme and pass it along to the parser. The basic idea behind the algorithm is:
ANTLR uses this algorithm to transform our regexes into a Java class containing an FSM which recognizes the lexemes in our language. This is easier and more maintainable than writing the lexer by hand.
Copyright © 2024 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.