Introducing regular expressions
We have all used file globbing with wildcard characters like *
and ?
as a means to select specific files or lines of data from a data stream. These tools are powerful and I use them many times a day. Yet, there are things that cannot be done with wildcards.
Regular expressions (regexes or REs) provide us with more complex and flexible pattern
matching capabilities. Just as certain characters take on special meaning when using file globbing, REs also have special characters. There are two main types of regular expressions (REs), Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs).
The first thing we need are some definitions. There are many definitions for the term regular expressions, but many are dry and uninformative. Here are mine.
Regular Expressions are strings of literal and metacharacters that can be used as patterns by various Linux utilities to match strings of ASCII plain text data in a data stream. When a match occurs, it can be used to extract or eliminate a line of data from the stream, or to modify the matched string in some way.
Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs) are not significantly different in terms of functionality. (See the grep info page’s Section 3.6, "Basic vs. Extended Regular Expressions.") The primary difference is in the syntax used and how metacharacters are specified. In basic regular expressions, the metacharacters ?
, +
, {
, |
, (
, and )
lose their special meaning. Instead, it is necessary to use the backslashed versions: \?
, \+
, \{
, \|
, \(
, and \)
. The ERE syntax is believed by many to be easier to use.
Note: When I talk about regular expressions, in a general sense I usually mean to include both basic and extended regular expressions. If there is a differentiation to be made I will use the acronyms BRE for basic regular expressions or ERE for extended regular expressions.
Regular expressions (REs) take the concept of using metacharacters to match patterns in data streams much further than file globbing, and give us even more control over the items we select from a data stream. REs are used by various tools to parse a data stream to match patterns of characters in order to perform some transformation on the data.
Note: One general meaning of parse is to examine something by studying its component parts. For our purposes, we parse a data stream to locate sequences of characters that match a specified pattern.
Regular expressions have a reputation for being obscure and arcane incantations that only those with special wizardly sysadmin powers use. This single line of code below (that I used to transform a file that was sent to me into a usable form) would seem to confirm this:
$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/\]//g" -e "s/)//g" | awk '{print $1" "$2" <"$3">"}' > addresses.txt
This command pipeline appears to be an intractable sequence of meaningless gibberish to anyone without the knowledge of regex. It certainly seemed that way to me the first time I encountered something similar early in my career. As you will see, regexes are relatively simple once they are explained.
We can only begin to touch upon all of the possibilities opened to us by regular expressions in a single article (even in a single series). There are entire books devoted exclusively to regular expressions, so we will explore the basics in a series of articles here on Enable Sysadmin over the coming week. By the end, you will know just enough to get started with tasks common to sysadmins. Hopefully, you’ll be hungry to learn more on your own after that.
Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux book, Using and Administering Linux: Zero to SysAdmin, due out from Apress in late 2019.
David Both
David Both is an open source software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh, NC. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for over 50 years. More about me