Lexical analysis is the process of breaking a program into lexemes. A lexeme is a single, indivisible unit in a program.
For example, the following code:
for(int i = 0; i < 10; i++) {
// some comment here
array[i] = array[i] + 2.5;
}
Contains the following lexemes:
Lexemes come in different types called tokens. Tokens include identifiers, reserved words, numbers, strings and symbols.
Lexical analysis also strips comments out of a program and must check for lexical errors. Below is a C++ program with a lexical error in it:
#include <iostream>
using namespace std;
int main() {
int name = ☯;
return 0;
}
Lexical errors are somewhat rare since most combinations of characters are valid lexemes in most languages.
Identifiers are one of the most common tokens in programs. They are typically composed of ASCII letters, numbers and underscores.
Languages have different rules regarding the form of valid identifiers. C, C++, Ruby and Python use the following rules:
Java also allows dollar signs to be included.
Haskell allows single quotes.
LISP languages allow hyphens, question marks and more.
Another common token is key words. Key words are the words in a program that have a special meaning such as "if", "while", "throw" etc.
Reserved words are words that cannot be used as identifiers names. In most modern languages, they are the same thing.
Numbers are another token. Most languages have similar rules about numbers in programs. Integers are simply written as a sequence of digits. Floating point numbers are written using a decimal point or scientific notation.
C and C++ also allow trailing letters to signify the data type of the constant. They also allow for octal and hexadecimal values. The following are all valid numeric constants in C:
42 | decimal |
052 | octal |
0x2A | hex |
42L | long int |
42LL | long long int |
42.0 | double |
42.0f | float |
4.2e10 | scientific |
4.2e10f | scientific float |
The lexical analyzer must be able to recognize every representation for these.
Different languages have different rules for how comments look. Some have a start symbol and end symbol such as /* and */ in C. Others have only a start symbol and go through the end of the line such as // in Java and # in Python.
Older languages tended to use keywords for comments instead of symbols such as the "REM" in BASIC or "C" in FORTRAN.
Comments with begin and end symbols typically have issues with nested comments.
Languages typically use text enclosed in " or ' characters as string literals.
Perl has two different types of string literals. Strings in single quotes are taken verbatim, while strings in double quotes are interpolated.
Other types of tokens include arithmetic and logic operators:
+ - * / % & | ^ ~ ! <...
as well as structural symbols:
() [ ] { } begin end...
Lexical analysis often ignores whitespace, but there are some cases where it is important. In Python, indentation is used to control blocks instead of braces. To implement this, a lexer must keep track of the indentation level and insert extra INDENT and DEDENT tokens.
Regular expressions are used to formally define the lexical elements of a programming language. Regular expressions are composed of a few simple rules:
We can extend regular expressions with other rules:
We can define regular expressions for identifiers, numbers, comments and any other lexical element in a program.
Finite state machines recognize regular expressions. This means we can build a state machine to match the different tokens in our language. Below is a state machine recognizing an identifier using C rules:
By combining state machines for every token, we can build a lexer that analyzes entire languages.
A simpler, more common, way of producing lexical analyzers is to write the patterns as regular expressions, then convert them into finite automata using the following rules:
Because this is a mechanical process, it can be done by a program such as lex.
Copyright © 2022 Ian Finlayson | Licensed under a Attribution-NonCommercial 4.0 International License.