We have looked at grep
which is able to search for patterns in its
input. However, the patterns we have seen so far were all just simple text.
grep
, and tools like find
, vim
, sed
and others
are able to search for more complex patterns called regular expressions
or simply regexes.
regexes are somewhat similar to the "wild cards" that the shell recognizes such as "*.txt" which matches all files ending in ".txt", except they are more powerful.
regexes consist of text which is matched exactly, along with other characters which have special meanings for describing patterns, which are covered below.
By default, grep
treats many of the special characters for
regexes as just regular text. To make grep
treat them as regular
expression symbols, we must pass the -E flag.
As a first example, we can consider the "|" symbol. Inside of a regex, this means "or", so if we want to match "alligator" or "crocodile", we could use this regex:
ifinlay@cpsc:~$ grep -E 'alligator|crocodile' /usr/share/dict/words alligator alligator's alligators crocodile crocodile's crocodiles
Here we are using grep
to search a dictionary file which contains
a list of nearly 100 thousand English words. Note that we are passing the
-E flag to grep
. Without it, grep
would find nothing as
it would be looking for | as a literal part of the search string.
Also note the use of single quotes around the regex. Those are
necessary because otherwise, the shell would see the | symbol as a pipe
and try to pass the output of grep
to the "crocodile
"
command which sadly does not exist.
The '.' symbol matches any character at all in a regex. So if we wanted to find all three words which contain a 'z', then any letter, followed by a second 'z', we could use:
ifinlay@cpsc:~$ grep -E 'z.z' /usr/share/dict/words Azazel Azazel's Brzezinski Brzezinski's pizazz pizzazz
Note that the "." matches to an 'a' in most of the results, but matches to an 'e' in "Brzezinski". There are apparently no other English words which fit this pattern.
What words consist of at least 20 letters?
ifinlay@cpsc:~$ grep -E '....................' /usr/share/dict/words Andrianampoinimerina Andrianampoinimerina's counterintelligence's counterrevolutionaries counterrevolutionary counterrevolutionary's disenfranchisement's electrocardiograph's electroencephalogram electroencephalogram's electroencephalograms electroencephalograph electroencephalograph's electroencephalographs oversimplification's transubstantiation's uncharacteristically
Note that grep
searches only on a line-basis. Only lines which
contain the pattern above are matched; a pattern cannot span multiple lines.
Notice that grep will produce matches even in the middle of a word.
Anchors allow us to specify that a match should be anchored at a specific point, but do not consume a character.
Anchor | Meaning |
^ | Start of a line. |
$ | End of a line. |
\< | Start of a word. |
\> | End of a word. |
For instance, if we grep for 'x', we will get any line that contains a x any place in it. If we grep for '^x' we will match only those lines which start with x:
ifinlay@cpsc:~$ grep -E '^x' /usr/share/dict/words x xenon xenon's xenophobia xenophobia's xenophobic xerographic xerography xerography's xylem xylem's xylophone xylophone's xylophones xylophonist xylophonists
The following regex will search for words of four letters which both begin and end with 'a':
ifinlay@cpsc:~$ grep -E '^a..a$' /usr/share/dict/words alga aqua area aria aura
What if we wanted to search for any word which both began and ended with an 'a'? We could attempt something like the following:
ifinlay@cpsc:~$ grep -E '(^a$)|(^aa$)|(^a.a$)|(^a..a$)|(^a...a$)' /usr/share/dict/words a aha alga aloha alpha ameba aorta aqua area arena aria aroma atria aura
And we could continue on for every case up until we had covered all possible words. Notice that, just as in math, parentheses are used for control of precedence in regular expressions. It would be better, however, to use one of the regular expression grouping operators:
Operator | Meaning |
* | Zero or more of the preceding element. |
+ | One or more of the preceding element. |
? | Zero or one of the preceding element. |
{N} | Match exactly N of the preceding element where N is an integer. |
With these, we can shorten our regular expression which finds words beginning and ending with 'a':
ifinlay@cpsc:~$ grep -E '^a.*a$' /usr/share/dict/words
We could also use the last form to simplify our search for words of at least 20 letters:
ifinlay@cpsc:~$ grep -E '.{20}' /usr/share/dict/words
What if we want to actually search for one of the characters with special meaning? e.g. if we want to search for ellipses in a paper we write, we could do:
ifinlay@cpsc:~$ grep -E '...' paper.txt
However, this will match every line which has at least three consecutive characters on it.
In order to actually match a regex operator literally, we "escape it" with a :
ifinlay@cpsc:~$ grep -E '\.\.\.' paper.txt
This allows us to selectively decide whether to treat the operators as literal characters or as regex operators.
Below are some other escape sequences that are useful:
Escape Sequence | Meaning |
\s | Any whitespace. |
\S | Anything but whitespace. |
\w | A "word" character (not punctuation). |
\W | A "non-word" character (punctuation or space). |
If we want to match any decimal digit, we could do the following:
(0|1|2|3|4|5|6|7|8|9)
However, a simpler way to do this is with a character class. The following regex will match any digit as well:
[0123456789]
We can also use a range:
[0-9]
We could also use these with letter ranges as well. The following will find all words which both begin and end with a capital letter:
ifinlay@cpsc:~$ grep -E '^[A-Z].*[A-Z]$' /usr/share/dict/words AOL BMW FDR FNMA GE GTE IBM JFK LBJ LyX MCI MGM MIT MiG NORAD OHSA OK PhD RCA TWA UCLA
Oftentimes we want to match any character except for one or two exceptions. Rather than list all the possibilities, we can list the exceptions. The regex below matches any vowel:
[aeiou]
The regex below matches anything except for a vowel.
[^aeiou]
The caret as the first character after the opening bracket here means the character class is inverted.
To search for words that contain a q followed by a letter other than u, we could use the following:
ifinlay@cpsc:~$ grep -iE 'q[^u]' /usr/share/dict/words Chongqing Compaq's Esq's Iqaluit Iqaluit's Iqbal Iqbal's Iraq's Iraqi Iraqi's Iraqis Q's Qaddafi Qaddafi's Qantas Qantas's Qatar Qatar's Qingdao Qiqihar Qiqihar's Qom Qom's Sq's Urumqi
Notice we are using the '-i' ignore case option here, otherwise, we would not have gotten the ones with capital 'Q's.
A back reference allows us to reference some portion of a regex later on in the same regex. To reference some portion of a regex, the portion to reference must be enclosed in parentheses.
The back reference itself is a backslash followed by a number. The number refers to which parenthesized portion we are referencing.
For example, in the following regex:
'(.)(.)(.)\3\2\1'
\1 refers to the text matched by the regex inside the first set of parentheses, \2 refers to the second and \3 refers to the third. A subset of the output of this regex on the dictionary is:
ifinlay@cpsc:~$ grep -E '(.)(.)(.)\3\2\1' /usr/share/dict/words | head Brenner Brenner's Chattahoochee Chattahoochee's assesses braggart braggart's braggarts cassettes collocate
Vim supports regular expressions in its search as well.
Vim regexes differ from that of grep -E
in that the following symbols are
literal by default and need to be escaped in order to be operators:
The following do not need to be escaped, however:
Vim can also ignore case when doing searches. To get this behavior, run:
:set ic
Sed and Vim substitutions can also use regular expressions, including back references, to great effect.
Using a back reference in the replace portion is especially powerful. For example, say we are writing a program where we are setting some class variable directly as in the following line:
thing.property = 10;
Suppose we wanted to replace this with a member function (which is considered to be better style by most) so that the above code might look like this:
thing.setProperty(10);
Suppose we had dozens or hundreds of lines to change in this way. This can't be done without a regular expression, since we don't know in advance what value is being set. It's 10 here, but could be different on each line. It might not even be a single numeric value, but a whole expression.
Instead, we could use the following substitute command:
:%s/\.property\s*=\s*\([^;]\+\);/.setProperty(\1);/g
Regular expressions are easier to write than they are to read, but this regex pattern is constructed from the following parts:
\.property
- The literal word .property (the . is escaped so it matches a literal .)\s*=\s*
- An equal sign with optional space on either side.\([^;]\+\)
- The value assigned to property, which is one or more characters other than a ';'.
They are inside parentheses so we can back-reference them later.;
- A literal semicolon at the end.Then the replacement portion .setProperty(\1);
simply is the
call to setProperty with whatever value was being assigned inside the
parentheses.
Because Vim substitute commands share the same syntax as sed
commands, we can use sed
to perform this substitution across
multiple files at the same time.
Vim also allows us to use '&' in the substitution portion of a substitute command which refers to the entire text which was matched in the regex portion. This is used in the following substitution which comments out all lines containing a call to a "log" function:
:%s/^.*\<log(.*$/\/\/&
This regex contains:
^.*
- The beginning of the line, followed by any number of characters.\<log(
- The call to log(, anchored to the beginning of a word so as not to catch "convertToAnalog("..*$
- Any number of characters followed by the end of the line.\/\/&
. Here each \/
is an escaped
forward slash which is escaped because otherwise it will mark the end of the substitute command.
With two, it makes a C++ or Java comment (use #&
for Python). The &
then expands to the entire text matched with the regex above. So it substitutes each
matching line for itself, but following a //.
We can also do a regular expression search and replace to rename
files with the rename
command. This command takes a substitution
command, and a set of files, and applies the substitution to the file names.
For instance, if we wish to take a set of files and add "-backup" between
the name and the extension, we might use this command:
ifinlay@cpsc:~/temp$ ls input.txt output.txt program.py ifinlay@cpsc:~/temp$ rename 's/([^.]*)\.(.*)/\1-backup.\2/' * ifinlay@cpsc:~/temp$ ls input-backup.txt output-backup.txt program-backup.py
Here the regular expression we are matching is:
([^.]*)
- This is the main part of the file name and consists of zero or more
characters which are not '.', which does not have to be escaped in a character class.
It is in parentheses so it can be referenced as \1.\.
- The '.' between the file name and the extension.(.*)
- Whatever characters comprise the extension. It is in
parentheses so it can be referenced as \2.We then specify the new name as \1-backup.\2
which is the original
file name suffixed with "-backup", then the '.', and then the original extension.
rename
has a very helpful "-n" flag which makes rename
just tell
you what changes it would make without actually renaming anything. I recommend using
this first to see what will happen.
Regular expressions can look incomprehensible at first, but they are easier to write than read. Writing your first ones can be frustrating as a simple error can be hard to find.
Adding regexes as a tool will be worth it in the long run, however, as they can perform in a few lines what would otherwise be long and tedious tasks.
Copyright © 2024 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.