Home CPSC 401

Lexical Analysis

Overview

Lexical analysis is the process of breaking a program into lexemes. A lexeme is a single, indivisible unit in a program.

For example, the following code:


for(int i = 0; i < 10; i++) {
    // some comment here
    array[i] = array[i] + 2.5;
}
Contains the following lexemes:
  1. for
  2. (
  3. int
  4. i
  5. =
  6. 0
  7. ;
  8. i
  9. <
  10. 10
  11. i
  12. ++
  13. )
  14. {
  15. array
  16. [
  17. i
  18. ]
  19. =
  20. array
  21. [
  22. i
  23. ]
  24. +
  25. 2.5
  26. ;
  27. }

Lexemes come in different types called tokens. Tokens include identifiers, reserved words, numbers, strings and symbols.

Lexical analysis also strips comments out of a program and must check for lexical errors. Below is a C++ program with a lexical error in it:


#include <iostream>
using namespace std;

int main() {
  int name = ☯;
  return 0;
}

Lexical errors are somewhat rare since most combinations of characters are valid lexemes in most languages.


Common Tokens

Identifiers

Identifiers are one of the most common tokens in programs. They are typically composed of ASCII letters, numbers and underscores.

Languages have different rules regarding the form of valid identifiers. C, C++, Ruby and Python use the following rules:

Java also allows dollar signs to be included.

Haskell allows single quotes.

LISP languages allow hyphens, question marks and more.

Key words

Another common token is key words. Key words are the words in a program that have a special meaning such as "if", "while", "throw" etc.

Reserved words are words that cannot be used as identifiers names. In most modern languages, they are the same thing.

Numbers

Numbers are another token. Most languages have similar rules about numbers in programs. Integers are simply written as a sequence of digits. Floating point numbers are written using a decimal point or scientific notation.

C and C++ also allow trailing letters to signify the data type of the constant. They also allow for octal and hexadecimal values. The following are all valid numeric constants in C:

42decimal
052octal
0x2Ahex
42Llong int
42LLlong long int
42.0double
42.0ffloat
4.2e10scientific
4.2e10fscientific float

The lexical analyzer must be able to recognize every representation for these.

Comments

Different languages have different rules for how comments look. Some have a start symbol and end symbol such as /* and */ in C. Others have only a start symbol and go through the end of the line such as // in Java and # in Python.

Older languages tended to use keywords for comments instead of symbols such as the "REM" in BASIC or "C" in FORTRAN.

Comments with begin and end symbols typically have issues with nested comments.

Strings

Languages typically use text enclosed in " or ' characters as string literals.

Perl has two different types of string literals. Strings in single quotes are taken verbatim, while strings in double quotes are interpolated.

Symbols & Structure

Other types of tokens include arithmetic and logic operators:


+ - * / % & | ^ ~ ! <...
as well as structural symbols:

() [ ] { } begin end...

Handling Whitespace

Lexical analysis often ignores whitespace, but there are some cases where it is important. In Python, indentation is used to control blocks instead of braces. To implement this, a lexer must keep track of the indentation level and insert extra INDENT and DEDENT tokens.


Regular Expressions

Regular expressions are used to formally define the lexical elements of a programming language. Regular expressions are composed of a few simple rules:

  1. Every character in the alphabet is a regular expression by itself.
  2. XY is a regular expression when X and Y are regular expressions. This means the pattern is X followed by Y.
  3. X | Y is a regular expression when X and Y are regular expressions. This means X or Y is matched by the pattern.
  4. X* is a regular expression when X is a regular expression. This means 0 or more repetitions of X.

We can extend regular expressions with other rules:

  1. Ranges such as A-Z or 0-9.
  2. X+ which is 1 or more repetition of X. Identical to XX*.
  3. . which matches any character.
  4. ^ and $ which match the start and end of a line.

We can define regular expressions for identifiers, numbers, comments and any other lexical element in a program.


Finite State Machines

Finite state machines recognize regular expressions. This means we can build a state machine to match the different tokens in our language. Below is a state machine recognizing an identifier using C rules:

By combining state machines for every token, we can build a lexer that analyzes entire languages.


Converting Regular Expressions into State Machines

A simpler, more common, way of producing lexical analyzers is to write the patterns as regular expressions, then convert them into finite automata using the following rules:

  1. Concatenation is given as sequential states.
  2. Or is given as multiple input on a single transition.
  3. Repetition is given as a transition to the same state.

Because this is a mechanical process, it can be done by a program such as lex.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.