In Introducing regular expressions, I covered what they are and why they’re useful. Now let’s take a deeper look at how they’re created. Because GNU grep
is one of the tools I use the most (that provides a more or less standardized implementation of regular expressions), I will use that set of expressions as the basis for this article. We will then look at sed
(another tool that uses regular expressions) in a later article.
All implementations of regular expressions are line-based. A pattern created by a combination of one or more expressions is compared against each line of a data stream. When a match is made, an action is taken on that line as prescribed by the tool being used.
For example, when a pattern match occurs with grep
, the usual action is to pass that line to STDOUT and discard lines that do not match the pattern. As we saw in Getting started with regular expressions: An example, the -v
option reverses those actions, so that the lines with matches are discarded.
Each line of the data stream is evaluated on its own. Think of each data stream line as a record, where the tools that use regexes process one record at a time. When a match is made, an action defined by the tool in use is taken on the line that contains the matching string.
Regex building blocks
The following table contains a list of the basic building block expressions and metacharacters implemented by the GNU grep
command (and most other regex implementations), and their descriptions. When used in a pattern, each of these expressions or metacharacters matches a single character in the data stream being parsed:
Expression | Description |
Alphanumeric characters Literals A-Z,a-z,0-9 |
All alphanumeric and some punctuation characters are considered as literals. Thus the letter a in a regex will always match the letter "a" in the data stream being parsed. There is no ambiguity for these characters. Each literal character matches one and only one character. |
. (dot) | The dot (.) metacharacter is the most basic form of expression. It matches any single character in the position it is encountered in a pattern. So the pattern b.g would match "big," "bigger," "bag," "baguette," and "bog," but not "dog," "blog," "hug," "lag," "gag," "leg," etc. |
Bracket expression [list of characters] |
GNU grep calls this a bracket expression, and it is the same as a set for the Bash shell. The brackets enclose a list of characters to match for a single character location in the pattern. [abcdABCD] matches the letters "a," b," "c," or "d" in either upper- or lowercase. [a-dA-D] specifies a range of characters that creates the same match. [a-zA-Z] matches the alphabet in upper- and lowercase. |
[:class name:] Character classes |
This is a POSIX attempt at regex standardization. The class names are supposed to be obvious. For example, the [:alnum:] class matches all alphanumeric characters. Other classes are [:digit :] which matches any one digit 0-9, [:alpha:] ,[:space:] , and so on. Note that there may be issues due to differences in the sorting sequences in different locales. Read the grep man page for details. |
^ and $ Anchors |
These two metacharacters match the beginning and ending of a line, respectively. They are said to anchor the rest of the pattern to either the beginning or end of a line. The expression ^b.g would only match "big," "bigger," "bag," etc., as shown above if they occur at the beginning of the line being parsed. The pattern b.g$ would match "big" or "bag" only if they occur at the end of the line, but not "bigger." |
Let’s explore these building blocks before continuing on with some of the modifiers. The text file we will use for Experiment 3 is from a lab project I created for an old Linux class I used to teach. It was originally in a LibreOffice Writer odt file but I saved it to an ASCII text file. Most of the formatting of things like tables was removed, but the result is a long ASCII text file that we can use for this series of experiments.
Example: TOC entries
Let’s take a look at an example to explore what we’ve just learned. First, make the ~/testing
directory your PWD (create it if you didn't already in the previous article in this series), and then download the sample file from the GitHub.
[student@studentvm1 testing]$ wget https://raw.githubusercontent.com/opensourceway/reg-ex-examples/master/Experiment_6-3.txt
To begin, use the less
command to look at and explore the Experiment_6-3.txt
file for a few minutes to get an idea of its content.
Now, let’s use some simple grep
expressions to extract lines from the input data stream. The Table of Contents (TOC) contains a list of projects and their respective page numbers in the PDF document. Let’s extract the TOC starting with lines ending in two digits:
[student@studentvm1 testing]$ grep [0-9][0-9]$ Experiment_6-3.txt
This command is not really what we want. It displays all lines that end in two digits and misses TOC entries with only one digit. We'll look at how to deal with an expression for one or more digits in a later experiment. Looking at the whole file in less
, we could do something like this.
[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[0-9]$"
This command is much closer to what we want, but it is not quite there. We get some lines from later in the document that also match these expressions. If you study the extra lines and look at those in the complete document, you can see why they match while not being part of the TOC.
This command also misses TOC entries that do not start with "Lab Project." Sometimes this result is the best you can do, and it does give a better look at the TOC than we had before. We will look at how to combine these two grep
instances into a single one in a later experiment.
Now, let’s modify this command a bit and use the POSIX expression. Note the double square braces ([[]]
) around it:
[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[[:digit:]]$"
Single braces generate an error message.
This command gives the same results as the previous attempt.
Example: systemd
Let’s look for something different in the same file:
[student@studentvm1 testing]$ grep systemd Experiment_6-3.txt
This command lists all occurrences of "systemd" in the file. Try using the -i
option to ensure that you get all instances, including those that start with uppercase letters (the official form of "systemd" is all lowercase). Or, you could change the literal expression to Systemd
.
Count the number of lines containing the string systemd
. I always use -i
to ensure that all instances of the search expression are found regardless of case:
[student@studentvm1 testing]$ grep -i systemd Experiment_6-3.txt | wc
20 478 3098
As you can see, I have 20 lines, and you should have the same number.
Example: Metacharacters
Here is an example of matching a metacharacter: the left bracket ([
). First, let’s try without doing anything special:
[student@studentvm1 testing]$ **grep -i "[" Experiment_6-3.txt**
grep: Invalid regular expression
This error occurs because [
is interpreted as a metacharacter. We need to escape this character with a backslash (\
) so that it is interpreted as a literal character and not as a metacharacter:
[student@studentvm1 testing]$ grep -i "\[" Experiment_6-3.txt
Most metacharacters lose their special meaning when used inside bracket expressions:
- To include a literal
]
, place it first in the list. - To include a literal
^
, place it anywhere but first. - To include a literal
[
, place it last.
Repetition
Regular expressions can be modified using operators that let you specify zero, one, or more repetitions of a character or expression. These repetition operators are placed immediately following the literal character or metacharacter used in the pattern:
Operator | Description |
? |
In regexes the |
* |
The character preceding the * will be matched zero or more times without limit. In this example, drives* matches "drive," "drives", and "drivesss" but not "driver." Again, this is a bit different from the behavior of * in a glob. |
+ |
The character preceding the + will be matched one or more times. The character must exist in the line at least once for a match to occur. As one example, drives+ matches "drives," and "drivesss" but not "drive" or "driver." |
{n} |
This operator matches the preceding character exactly n times. The expression drives{2} matches "drivess" but not "drive," "drives," "drivesss," or any number of trailing "s" characters. However, because "drivesssss" contains the string drivess , a match occurs on that string, so the line would be a match by grep . |
{n,} |
This operator matches the preceding character n or more times. The expression drives{2,} matches "drivess" but not "drive," "drives," "drivess ," "drives," or any number of trailing "s" characters. Because "drivesssss" contains the string drivess , a match occurs. |
{,m} |
This operator matches the preceding character no more than m times. The expression drives{,2} matches "drive," "drives," and "drivess," but not "drivesss," or any number of trailing "s" characters. Once again, because "drivesssss" contains the string drivess , a match occurs. |
{n,m} |
This operator matches the preceding character at least n times, but no more than m times. The expression drives{1,3} matches "drives," "drivess," and "drivesss," but not "drivessss" or any number of trailing "s" characters. Once again, because "drivesssss" contains a matching string, a match occurs. |
As an example, run each of the following commands and examine the results carefully, so that you understand what is happening:
[student@studentvm1 testing]$ **grep -E files? Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives*" Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives+" Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives{2}" Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives{2,}" Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives{,2}" Experiment_6-3.txt**
[student@studentvm1 testing]$ **grep -Ei "drives{2,3}" Experiment_6-3.txt**
Be sure to experiment with these modifiers on other text in the sample file.
Metacharacter modifiers
There are still some interesting and important modifiers that we need to explore:
Modifier | Description |
< |
This special expression matches the empty string at the beginning of a word. The expression <fun would match "fun" and "Function," but not "refund." |
> |
This special expression matches the normal space, or empty (" ") string at the end of a word, as well as punctuation that typically appears in the single-character string at the end of a word. So environment> matches "environment," "environment," and "environment," but not "environments" or "environmental." |
^ |
In a character class expression, this operator negates the list of characters. Thus, while the class [a-c] matches "a," "b," or "c," in that position of the pattern, the class [^a-c] matches anything but "a," "b," or "c." |
| |
When used in a regex, the | metacharacter is a logical "or" operator. It is officially called the infix or alternation operator. We have already encountered this one in Getting started with regular expressions: An example, where we saw that the regex "Team|^\s*$" means, "a line with 'Team' or (| ) an empty line that has zero, one, or more whitespace characters such as spaces, tabs, and other unprintable characters." |
( and ) |
The parentheses ( and ) allow us to ensure a specific sequence of pattern comparison, like might be used for logical comparisons in a programming language. |
We now have a way to specify word boundaries with the \<
and \>
metacharacters. This means that we can now be even more explicit with our patterns. We can also use logic in more complex patterns.
As an example, start with a couple of simple patterns. This first one selects all instances of drives
but not drive
, drivess
, or additional trailing "s" characters:
[student@studentvm1 testing]$ **grep -Ei "\<drives\>" Experiment_6-3.txt**
Now let’s build up a search pattern to locate references to tar
(the tape archive command) and related references. The first two iterations display more than just tar
-related lines:
[student@studentvm1 testing]$ grep -Ei "tar" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "\<tar" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ein "\<tar\>" Experiment_6-3.txt
The -n
option in the last command above displays the line numbers for each line in which a match occurred. This option can assist in locating specific instances of the search pattern.
Tip: Matching lines of data can extend beyond a single screen, especially when searching a large file. You can pipe the resulting data stream through the less utility and then use the less search facility which implements regexes, too, to highlight the occurrences of matches to the search pattern. The search argument in less is:
\<tar\>
.This next pattern searches for "shell script," "shell program," "shell variable," "shell environment," or "shell prompt" in our test document. The parentheses alter the logical order in which the pattern comparisons are resolved:
[student@studentvm1 testing]$ grep -Eni "\<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt
Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux book, "Using and Administering Linux: Zero to SysAdmin," due out from Apress in late 2019.
Remove the parentheses from the preceding command and run it again to see the difference.
Wrapping up
Although we have now explored the basic building blocks of regular expressions in grep
, there are an infinite variety of ways in which they can be combined to create complex yet elegant search patterns. However, grep
is a search tool, and does not provide any direct capability to edit or modify a line of text in the data stream when a match is made. For that purpose, we need a tool like sed
, which I cover in my next article.
About the author
David Both is an open source software and GNU/Linux advocate, trainer,
writer, and speaker who lives in Raleigh, NC. He is a strong
proponent of and evangelist for the "Linux Philosophy."
David has been in the IT industry for over 50 years. He has taught RHCE
classes for Red Hat and has worked at MCI Worldcom, Cisco, and the State
of North Carolina. He has been working with Linux and open source
software for over 20 years.
David likes to purchase the components and build his own computers from
scratch to ensure that each new computer meets his exacting
specifications. His primary workstation is an ASUS TUF X299 motherboard
and an Intel i9 CPU with 16 cores (32 CPUs) and 64GB of RAM in a
CoolerMaster MasterFrame 700.
David has written articles for magazines including Linux Magazine and
Linux Journal. His article "Complete Kickstart," co-authored with a
colleague at Cisco, was ranked 9th in the Linux Magazine Top Ten Best
System Administration Articles list for 2008. David currently writes
prolifically for OpenSource.com and Enable Sysadmin.
David currently has five books published with Apress, "The Linux
Philosophy for SysAdmins," a self-study training course in three
volumes "Using and Administering Linux: Zero to SysAdmin," that was
released in late 2019, and "Linux for Small Business Owners" with
co-author Cyndi Bulka.
David can be reached at LinuxGeek46@both.org or on Twitter @LinuxGeek46.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit