Home CPSC 401

Using Flex

What is Flex?

Writing a lexical analyzer by hand is tedious and error-prone. A state machine to match all lexemes in a real language will have many states. Flex makes this process easier by automatically producing a lexical analyzer from a specification.

Flex is an improved version of Lex which was originally written in the early 70's for UNIX systems. Flex is a newer, open source, program that adds support for C++.


Overview

Flex refers to both a meta-language for defining a lexer as well as a tool that transforms that language into C (or C++) code.

Flex is often used in conjunction with bison which is a parser generator, but may be used on its own as well.


Flex Input File

Flex input files are broken into four basic sections:

  1. C code header section which will be copied into the top of the C file. It's used for includes, global variables and function declarations.
  2. The definitions section which defines values and options for flex.
  3. The rules section which contains regular expressions describing patterns to be matched. Along with each rule is an action which is a snippet of C code that will be run when the pattern is matched.
  4. The C code body which is copied into the bottom of the C file. This is used for function definitions including main.

The header is at the top of the file enclosed inside %{ and %} symbols. The other three sections are separated by %% symbols:


%{
/* C code header */
%}

/* definitions */

%%

/* rules */

%%

/* C code body */

Definitions

Flex definitions define names and corresponding regular expressions. They are used for defining an expression in one place and using it in multiple places, or for breaking an expression into pieces. The form of the definitions is the name, followed by whitespace, followed by the regular expression. For example:


DIGIT [0-9]

You do not need any definitions in this section.

One handy option is:

%option noyywrap

This means you do not need to provide a "wrapping" function that can be used for reading multiple files of input. This is rarely useful, so the option tells flex to not look for this function.


Rules

The rules section contains regular expressions matched with C code that is executed whenever the regular expression is matched. The syntax for this is the regular expression on the left followed by the action enclosed in curly braces:


[a-z]    {printf("A letter!");}
{DIGIT}+ {printf("A Number!");}

Flex Regular Expression Symbols

""Anything in quotes is matched exactly.
[]Characters in brackets match any expression containing any of the characters in brackets. For example [abc] matches one a, one b, or one c.
[^]If there is a ^ character after the first bracket, it matches any character except those in brackets. For example, [^abc] matches any character except a, b, or c.
.Matches any character except the newline.
\nMatches a newline.
^Matches the beginning of a line.
$Matches the end of a line.
*Matches zero or more copies of the preceding expression.
+Matches one or more copies of the preceding expression.
?Matches zero or one copy of the preceding expression.
|Matches either the preceding expression or the following one.
()Parenthesis are used for grouping operators. For example a(bc|de) matches abc or ade.
\*A iteral * character.
\"A literal " character.
\^A literal ^ character.

Regular Expression Examples


Actions

Actions can consist of any C code. They can set variables, call print statements, and anything else.

They can also reference a few special variables:

Ambiguous Patterns

Consider the "+=" operator in C++ and Java. The lexer has to know that this is a single lexeme rather than the two lexemes "+" and "=". Flex handles this case by always matching the longest pattern it can. If our lexer has patterns for "+=", "+" and "=", flex will choose to match "+=" because it is one character longer.


"+"      {return PLUS;}
"="      {return EQUALS;}
"+="     {return PLUSEQUALS;}

Consider keywords such as "if". This matches the rules for an identifier, so how if we have a flex files like this:


"if"        {return IF;}
[a-zA-Z]+   {return IDENTIFIER;}

Then "if" matches both patterns. Also, the patterns are the same length. Flex makes this decision based on which pattern appears first. So it will match "if" to the keyword and not the identifier.


Example: wc

As a simple flex example, we can emulate the "wc" UNIX program which counts the number of words, characters and lines in a file. This program is very easily written in flex:


%{
#include <stdio.h>
#include <string.h>

int chars = 0;
int words = 0;
int lines = 0;
%}

%option noyywrap

%%

[a-zA-Z]+   {words++; chars += yyleng;}
\n          {chars++; lines++;}
.           {chars++;}

%%

int main() {
  yylex();
  printf("Characters: %d\nWords: %d\nLines: %d\n", chars, words, lines);
  return 0;
}

The actions for the rules simply increment counters when they see words, lines and characters. To compile and run this program, we must first run flex to generate the C code, then compile the C code, and finally run the program:


flex wc.l
gcc lex.yy.c
a.out < input-file

Example: Rot13

As another example, we can write a Rot13 encoder/decoder. Rot13 is an extremely simple encoding, where we rotate each letter 13 positions. Because there are 26 letters, if we apply Rot13 twice, we get back to where we started. Here is the Flex code for this:


%{
#include <stdio.h>
%}

%option noyywrap

%%
[a-mA-M]  {printf("%c", yytext[0] + 13);}
[n-zN-Z]  {printf("%c", yytext[0] - 13);}
.         {printf("%s", yytext);}
%%

int main() {
  yylex();
  return 0;
}

yylex return value

In your interpreter assignment, you will need to have yylex return values to the main program. The values will be codes indicating which type of token was matched. This is done by placing return statements in the actions of rules.

For example:


%{
#define DIGIT 100
%}

%%
[0-9]+    {return DIGIT;}

/* more rules */

%%
int main() {
  int token;
  do {
    toekn = yylex();
    
    /* handle each type of token */

  } while(token != 0);

  return 0;
}
yylex will automatically return 0 when it encounters the end of the input fille.

Conclusion

Flex is an extremely powerful tool that greatly simplifies the task of lexical analysis. It is used for compilers and interpreters, but is also useful for programs that must perform complex input reading.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.