Home CPSC 225

Regular Expressions

 

Overview

We have looked at grep which is able to search for patterns in its input. However, the patterns we have seen so far were all just simple text. grep, and tools like find, vim, sed and others are able to search for more complex patterns called regular expressions or simply regexes.

regexes are somewhat similar to the "wild cards" that the shell recognizes such as "*.txt" which matches all files ending in ".txt", except they are more powerful.

regexes consist of text which is matched exactly, along with other characters which have special meanings for describing patterns, which are covered below.


 

Using Regular Expressions with grep

By default, grep treats many of the special characters for regexes as just regular text. To make grep treat them as regular expression symbols, we must pass the -E flag.

As a first example, we can consider the "|" symbol. Inside of a regex, this means "or", so if we want to match "alligator" or "crocodile", we could use this regex:

ifinlay@cpsc:~$ grep -E 'alligator|crocodile' /usr/share/dict/words
alligator
alligator's
alligators
crocodile
crocodile's
crocodiles

Here we are using grep to search a dictionary file which contains a list of nearly 100 thousand English words. Note that we are passing the -E flag to grep. Without it, grep would find nothing as it would be looking for | as a literal part of the search string.

Also note the use of single quotes around the regex. Those are necessary because otherwise, the shell would see the | symbol as a pipe and try to pass the output of grep to the "crocodile" command which sadly does not exist.


 

Matching any Character

The '.' symbol matches any character at all in a regex. So if we wanted to find all three words which contain a 'z', then any letter, followed by a second 'z', we could use:

ifinlay@cpsc:~$ grep -E 'z.z' /usr/share/dict/words
Azazel
Azazel's
Brzezinski
Brzezinski's
pizazz
pizzazz

Note that the "." matches to an 'a' in most of the results, but matches to an 'e' in "Brzezinski". There are apparently no other English words which fit this pattern.

What words consist of at least 20 letters?

ifinlay@cpsc:~$ grep -E '....................' /usr/share/dict/words
Andrianampoinimerina
Andrianampoinimerina's
counterintelligence's
counterrevolutionaries
counterrevolutionary
counterrevolutionary's
disenfranchisement's
electrocardiograph's
electroencephalogram
electroencephalogram's
electroencephalograms
electroencephalograph
electroencephalograph's
electroencephalographs
oversimplification's
transubstantiation's
uncharacteristically

Note that grep searches only on a line-basis. Only lines which contain the pattern above are matched; a pattern cannot span multiple lines.


 

Anchors

Notice that grep will produce matches even in the middle of a word.

Anchors allow us to specify that a match should be anchored at a specific point, but do not consume a character.

AnchorMeaning
^Start of a line.
$End of a line.
\<Start of a word.
\>End of a word.

For instance, if we grep for 'x', we will get any line that contains a x any place in it. If we grep for '^x' we will match only those lines which start with x:

ifinlay@cpsc:~$ grep -E '^x' /usr/share/dict/words
x
xenon
xenon's
xenophobia
xenophobia's
xenophobic
xerographic
xerography
xerography's
xylem
xylem's
xylophone
xylophone's
xylophones
xylophonist
xylophonists

The following regex will search for words of four letters which both begin and end with 'a':

ifinlay@cpsc:~$ grep -E '^a..a$' /usr/share/dict/words
alga
aqua
area
aria
aura

 

Repetition

What if we wanted to search for any word which both began and ended with an 'a'? We could attempt something like the following:

ifinlay@cpsc:~$ grep -E '(^a$)|(^aa$)|(^a.a$)|(^a..a$)|(^a...a$)' /usr/share/dict/words
a
aha
alga
aloha
alpha
ameba
aorta
aqua
area
arena
aria
aroma
atria
aura

And we could continue on for every case up until we had covered all possible words. Notice that, just as in math, parentheses are used for control of precedence in regular expressions. It would be better, however, to use one of the regular expression grouping operators:

OperatorMeaning
*Zero or more of the preceding element.
+One or more of the preceding element.
?Zero or one of the preceding element.
{N}Match exactly N of the preceding element where N is an integer.

With these, we can shorten our regular expression which finds words beginning and ending with 'a':

ifinlay@cpsc:~$ grep -E '^a.*a$' /usr/share/dict/words

We could also use the last form to simplify our search for words of at least 20 letters:

ifinlay@cpsc:~$ grep -E '.{20}' /usr/share/dict/words

 

Escaping

What if we want to actually search for one of the characters with special meaning? e.g. if we want to search for ellipses in a paper we write, we could do:

ifinlay@cpsc:~$ grep -E '...' paper.txt

However, this will match every line which has at least three consecutive characters on it.

In order to actually match a regex operator literally, we "escape it" with a :

ifinlay@cpsc:~$ grep -E '\.\.\.' paper.txt

This allows us to selectively decide whether to treat the operators as literal characters or as regex operators.

Below are some other escape sequences that are useful:

Escape SequenceMeaning
\sAny whitespace.
\SAnything but whitespace.
\wA "word" character (not punctuation).
\WA "non-word" character (punctuation or space).

 

Character Classes

If we want to match any decimal digit, we could do the following:

(0|1|2|3|4|5|6|7|8|9)

However, a simpler way to do this is with a character class. The following regex will match any digit as well:

[0123456789]

We can also use a range:

[0-9]

We could also use these with letter ranges as well. The following will find all words which both begin and end with a capital letter:

ifinlay@cpsc:~$ grep -E '^[A-Z].*[A-Z]$' /usr/share/dict/words
AOL
BMW
FDR
FNMA
GE
GTE
IBM
JFK
LBJ
LyX
MCI
MGM
MIT
MiG
NORAD
OHSA
OK
PhD
RCA
TWA
UCLA

 

Inverted Character Classes

Oftentimes we want to match any character except for one or two exceptions. Rather than list all the possibilities, we can list the exceptions. The regex below matches any vowel:

[aeiou]

The regex below matches anything except for a vowel.

[^aeiou]

The caret as the first character after the opening bracket here means the character class is inverted.

To search for words that contain a q followed by a letter other than u, we could use the following:

ifinlay@cpsc:~$ grep -iE 'q[^u]' /usr/share/dict/words
Chongqing
Compaq's
Esq's
Iqaluit
Iqaluit's
Iqbal
Iqbal's
Iraq's
Iraqi
Iraqi's
Iraqis
Q's
Qaddafi
Qaddafi's
Qantas
Qantas's
Qatar
Qatar's
Qingdao
Qiqihar
Qiqihar's
Qom
Qom's
Sq's
Urumqi

Notice we are using the '-i' ignore case option here, otherwise, we would not have gotten the ones with capital 'Q's.


 

Back References

A back reference allows us to reference some portion of a regex later on in the same regex. To reference some portion of a regex, the portion to reference must be enclosed in parentheses.

The back reference itself is a backslash followed by a number. The number refers to which parenthesized portion we are referencing.

For example, in the following regex:

'(.)(.)(.)\3\2\1'

\1 refers to the text matched by the regex inside the first set of parentheses, \2 refers to the second and \3 refers to the third. A subset of the output of this regex on the dictionary is:

ifinlay@cpsc:~$ grep -E '(.)(.)(.)\3\2\1' /usr/share/dict/words | head 
Brenner
Brenner's
Chattahoochee
Chattahoochee's
assesses
braggart
braggart's
braggarts
cassettes
collocate

 

Vim Regular Expressions

Vim supports regular expressions in its search as well.

Vim regexes differ from that of grep -E in that the following symbols are literal by default and need to be escaped in order to be operators:

The following do not need to be escaped, however:

Vim can also ignore case when doing searches. To get this behavior, run:

:set ic

 

Sed and Vim Substitutions

Sed and Vim substitutions can also use regular expressions, including back references, to great effect.

Using a back reference in the replace portion is especially powerful. For example, say we are writing a program where we are setting some class variable directly as in the following line:

thing.property = 10;

Suppose we wanted to replace this with a member function (which is considered to be better style by most) so that the above code might look like this:

thing.setProperty(10);

Suppose we had dozens or hundreds of lines to change in this way. This can't be done without a regular expression, since we don't know in advance what value is being set. It's 10 here, but could be different on each line. It might not even be a single numeric value, but a whole expression.

Instead, we could use the following substitute command:

:%s/\.property\s*=\s*\([^;]\+\);/.setProperty(\1);/g

Regular expressions are easier to write than they are to read, but this regex pattern is constructed from the following parts:

  1. \.property - The literal word .property (the . is escaped so it matches a literal .)
  2. \s*=\s* - An equal sign with optional space on either side.
  3. \([^;]\+\) - The value assigned to property, which is one or more characters other than a ';'. They are inside parentheses so we can back-reference them later.
  4. ; - A literal semicolon at the end.

Then the replacement portion .setProperty(\1); simply is the call to setProperty with whatever value was being assigned inside the parentheses.

Because Vim substitute commands share the same syntax as sed commands, we can use sed to perform this substitution across multiple files at the same time.

Vim also allows us to use '&' in the substitution portion of a substitute command which refers to the entire text which was matched in the regex portion. This is used in the following substitution which comments out all lines containing a call to a "log" function:

:%s/^.*\<log(.*$/\/\/&

This regex contains:

  1. ^.* - The beginning of the line, followed by any number of characters.
  2. \<log( - The call to log(, anchored to the beginning of a word so as not to catch "convertToAnalog(".
  3. .*$ - Any number of characters followed by the end of the line.
The substitution then is \/\/&. Here each \/ is an escaped forward slash which is escaped because otherwise it will mark the end of the substitute command. With two, it makes a C++ or Java comment (use #& for Python). The & then expands to the entire text matched with the regex above. So it substitutes each matching line for itself, but following a //.


 

Renaming Files with Regular Expressions

We can also do a regular expression search and replace to rename files with the rename command. This command takes a substitution command, and a set of files, and applies the substitution to the file names. For instance, if we wish to take a set of files and add "-backup" between the name and the extension, we might use this command:

ifinlay@cpsc:~/temp$ ls
input.txt  output.txt  program.py
ifinlay@cpsc:~/temp$ rename 's/([^.]*)\.(.*)/\1-backup.\2/' * 
ifinlay@cpsc:~/temp$ ls
input-backup.txt  output-backup.txt  program-backup.py

Here the regular expression we are matching is:

  1. ([^.]*) - This is the main part of the file name and consists of zero or more characters which are not '.', which does not have to be escaped in a character class. It is in parentheses so it can be referenced as \1.
  2. \. - The '.' between the file name and the extension.
  3. (.*) - Whatever characters comprise the extension. It is in parentheses so it can be referenced as \2.

We then specify the new name as \1-backup.\2 which is the original file name suffixed with "-backup", then the '.', and then the original extension.

rename has a very helpful "-n" flag which makes rename just tell you what changes it would make without actually renaming anything. I recommend using this first to see what will happen.


 

Conclusion

Regular expressions can look incomprehensible at first, but they are easier to write than read. Writing your first ones can be frustrating as a simple error can be hard to find.

Adding regexes as a tool will be worth it in the long run, however, as they can perform in a few lines what would otherwise be long and tedious tasks.

Copyright © 2024 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.