Skip to main content

Regex and grep: Data flow and building blocks

Read this third of four articles to learn how to make tighter matches with your regexes.
Image
Regex and grep: data flow and building blocks

In Introducing regular expressions, I covered what they are and why they’re useful. Now let’s take a deeper look at how they’re created. Because GNU grep is one of the tools I use the most (that provides a more or less standardized implementation of regular expressions), I will use that set of expressions as the basis for this article. We will then look at sed (another tool that uses regular expressions) in a later article.

All implementations of regular expressions are line-based. A pattern created by a combination of one or more expressions is compared against each line of a data stream. When a match is made, an action is taken on that line as prescribed by the tool being used.

For example, when a pattern match occurs with grep, the usual action is to pass that line to STDOUT and discard lines that do not match the pattern. As we saw in Getting started with regular expressions: An example, the -v option reverses those actions, so that the lines with matches are discarded.

Each line of the data stream is evaluated on its own. Think of each data stream line as a record, where the tools that use regexes process one record at a time. When a match is made, an action defined by the tool in use is taken on the line that contains the matching string.

Regex building blocks

The following table contains a list of the basic building block expressions and metacharacters implemented by the GNU grep command (and most other regex implementations), and their descriptions. When used in a pattern, each of these expressions or metacharacters matches a single character in the data stream being parsed:

Expression Description

Alphanumeric characters

Literals

A-Z,a-z,0-9

All alphanumeric and some punctuation characters are considered as literals. Thus the letter a in a regex will always match the letter "a" in the data stream being parsed. There is no ambiguity for these characters. Each literal character matches one and only one character.
. (dot) The dot (.) metacharacter is the most basic form of expression. It matches any single character in the position it is encountered in a pattern. So the pattern b.g would match "big," "bigger," "bag," "baguette," and "bog," but not "dog," "blog," "hug," "lag," "gag," "leg," etc.

Bracket expression

[list of characters]

GNU grep calls this a bracket expression, and it is the same as a set for the Bash shell. The brackets enclose a list of characters to match for a single character location in the pattern. [abcdABCD] matches the letters "a," b," "c," or "d" in either upper- or lowercase. [a-dA-D] specifies a range of characters that creates the same match. [a-zA-Z] matches the alphabet in upper- and lowercase.

[:class name:]

Character classes

This is a POSIX attempt at regex standardization. The class names are supposed to be obvious. For example, the [:alnum:] class matches all alphanumeric characters. Other classes are [:digit :] which matches any one digit 0-9, [:alpha:],[:space:], and so on. Note that there may be issues due to differences in the sorting sequences in different locales. Read the grep man page for details.

^ and $

Anchors

These two metacharacters match the beginning and ending of a line, respectively. They are said to anchor the rest of the pattern to either the beginning or end of a line. The expression ^b.g would only match "big," "bigger," "bag," etc., as shown above if they occur at the beginning of the line being parsed. The pattern b.g$ would match "big" or "bag" only if they occur at the end of the line, but not "bigger."

 

Let’s explore these building blocks before continuing on with some of the modifiers. The text file we will use for Experiment 3 is from a lab project I created for an old Linux class I used to teach. It was originally in a LibreOffice Writer odt file but I saved it to an ASCII text file. Most of the formatting of things like tables was removed, but the result is a long ASCII text file that we can use for this series of experiments.

Example: TOC entries

Let’s take a look at an example to explore what we’ve just learned. First, make the ~/testing directory your PWD (create it if you didn't already in the previous article in this series), and then download the sample file from the GitHub.

[student@studentvm1 testing]$  wget https://raw.githubusercontent.com/opensourceway/reg-ex-examples/master/Experiment_6-3.txt

To begin, use the less command to look at and explore the Experiment_6-3.txt file for a few minutes to get an idea of its content.

Now, let’s use some simple grep expressions to extract lines from the input data stream. The Table of Contents (TOC) contains a list of projects and their respective page numbers in the PDF document. Let’s extract the TOC starting with lines ending in two digits:

[student@studentvm1 testing]$  grep [0-9][0-9]$ Experiment_6-3.txt

This command is not really what we want. It displays all lines that end in two digits and misses TOC entries with only one digit. We'll look at how to deal with an expression for one or more digits in a later experiment. Looking at the whole file in less, we could do something like this.

[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[0-9]$"

This command is much closer to what we want, but it is not quite there. We get some lines from later in the document that also match these expressions. If you study the extra lines and look at those in the complete document, you can see why they match while not being part of the TOC.

This command also misses TOC entries that do not start with "Lab Project." Sometimes this result is the best you can do, and it does give a better look at the TOC than we had before. We will look at how to combine these two grep instances into a single one in a later experiment.

Now, let’s modify this command a bit and use the POSIX expression. Note the double square braces ([[]]) around it:

[student@studentvm1 testing]$ grep "^Lab Project" Experiment_6-3.txt | grep "[[:digit:]]$"

Single braces generate an error message.

This command gives the same results as the previous attempt.

Example: systemd

Let’s look for something different in the same file:

[student@studentvm1 testing]$ grep systemd Experiment_6-3.txt

This command lists all occurrences of "systemd" in the file. Try using the -i option to ensure that you get all instances, including those that start with uppercase letters (the official form of "systemd" is all lowercase). Or, you could change the literal expression to Systemd.

Count the number of lines containing the string systemd. I always use -i to ensure that all instances of the search expression are found regardless of case:

[student@studentvm1 testing]$ grep -i systemd Experiment_6-3.txt | wc
20      478     3098

As you can see, I have 20 lines, and you should have the same number.

Example: Metacharacters

Here is an example of matching a metacharacter: the left bracket ([). First, let’s try without doing anything special:

[student@studentvm1 testing]$  **grep -i "[" Experiment_6-3.txt**
grep: Invalid regular expression

This error occurs because [ is interpreted as a metacharacter. We need to escape this character with a backslash (\) so that it is interpreted as a literal character and not as a metacharacter:

[student@studentvm1 testing]$ grep -i "\[" Experiment_6-3.txt

Most metacharacters lose their special meaning when used inside bracket expressions:

  • To include a literal ], place it first in the list.
  • To include a literal ^, place it anywhere but first.
  • To include a literal [, place it last.

Repetition

Regular expressions can be modified using operators that let you specify zero, one, or more repetitions of a character or expression. These repetition operators are placed immediately following the literal character or metacharacter used in the pattern:

Operator Description
?

In regexes the ? means zero or one occurrence at most of the preceding character. So for example, drives? matches "drive," and "drives" but not "driver." This result is a bit different from the behavior of ? in a glob.

* The character preceding the * will be matched zero or more times without limit. In this example, drives* matches "drive," "drives", and "drivesss" but not "driver." Again, this is a bit different from the behavior of * in a glob.
+ The character preceding the + will be matched one or more times. The character must exist in the line at least once for a match to occur. As one example, drives+ matches "drives," and "drivesss" but not "drive" or "driver."
{n} This operator matches the preceding character exactly n times. The expression drives{2} matches "drivess" but not "drive," "drives," "drivesss," or any number of trailing "s" characters. However, because "drivesssss" contains the string drivess, a match occurs on that string, so the line would be a match by grep.
{n,} This operator matches the preceding character n or more times. The expression drives{2,} matches "drivess" but not "drive," "drives," "drivess ," "drives," or any number of trailing "s" characters. Because "drivesssss" contains the string drivess, a match occurs.
{,m} This operator matches the preceding character no more than m times. The expression drives{,2} matches "drive," "drives," and "drivess," but not "drivesss," or any number of trailing "s" characters. Once again, because "drivesssss" contains the string drivess, a match occurs.
{n,m} This operator matches the preceding character at least n times, but no more than m times. The expression drives{1,3} matches "drives," "drivess," and "drivesss," but not "drivessss" or any number of trailing "s" characters. Once again, because "drivesssss" contains a matching string, a match occurs.

 

As an example, run each of the following commands and examine the results carefully, so that you understand what is happening:

[student@studentvm1 testing]$  **grep -E files? Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives*" Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives+" Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives{2}" Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives{2,}" Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives{,2}" Experiment_6-3.txt**
[student@studentvm1 testing]$  **grep -Ei "drives{2,3}" Experiment_6-3.txt**

Be sure to experiment with these modifiers on other text in the sample file.

Metacharacter modifiers

There are still some interesting and important modifiers that we need to explore:

Modifier Description
< This special expression matches the empty string at the beginning of a word. The expression <fun would match "fun" and "Function," but not "refund."
> This special expression matches the normal space, or empty (" ") string at the end of a word, as well as punctuation that typically appears in the single-character string at the end of a word. So environment> matches "environment," "environment," and "environment," but not "environments" or "environmental."
^ In a character class expression, this operator negates the list of characters. Thus, while the class [a-c] matches "a," "b," or "c," in that position of the pattern, the class [^a-c] matches anything but "a," "b," or "c."
| When used in a regex, the | metacharacter is a logical "or" operator. It is officially called the infix or alternation operator. We have already encountered this one in Getting started with regular expressions: An example, where we saw that the regex "Team|^\s*$" means, "a line with 'Team' or (|) an empty line that has zero, one, or more whitespace characters such as spaces, tabs, and other unprintable characters."
( and ) The parentheses ( and ) allow us to ensure a specific sequence of pattern comparison, like might be used for logical comparisons in a programming language.

 

We now have a way to specify word boundaries with the \< and \> metacharacters. This means that we can now be even more explicit with our patterns. We can also use logic in more complex patterns.

As an example, start with a couple of simple patterns. This first one selects all instances of drives but not drive, drivess, or additional trailing "s" characters:

 [student@studentvm1 testing]$  **grep -Ei "\<drives\>" Experiment_6-3.txt**

Now let’s build up a search pattern to locate references to tar (the tape archive command) and related references. The first two iterations display more than just tar-related lines:

[student@studentvm1 testing]$ grep -Ei "tar" Experiment_6-3.txt
[student@studentvm1 testing]$ grep -Ei "\<tar" Experiment_6-3.txt
[student@studentvm1 testing]$  grep -Ein "\<tar\>" Experiment_6-3.txt

The -n option in the last command above displays the line numbers for each line in which a match occurred. This option can assist in locating specific instances of the search pattern.

Tip: Matching lines of data can extend beyond a single screen, especially when searching a large file. You can pipe the resulting data stream through the less utility and then use the less search facility which implements regexes, too, to highlight the occurrences of matches to the search pattern. The search argument in less is: \<tar\>.

This next pattern searches for "shell script," "shell program," "shell variable," "shell environment," or "shell prompt" in our test document. The parentheses alter the logical order in which the pattern comparisons are resolved:

[student@studentvm1 testing]$ grep -Eni "\<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt

Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux book, "Using and Administering Linux: Zero to SysAdmin," due out from Apress in late 2019.

Remove the parentheses from the preceding command and run it again to see the difference.

Wrapping up

Although we have now explored the basic building blocks of regular expressions in grep, there are an infinite variety of ways in which they can be combined to create complex yet elegant search patterns. However, grep is a search tool, and does not provide any direct capability to edit or modify a line of text in the data stream when a match is made. For that purpose, we need a tool like sed, which I cover in my next article.

Topics:   Linux   Regular Expressions  
Author’s photo

David Both

David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh, North Carolina. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for nearly 50 years. More about me

Related Content

OUR BEST CONTENT, DELIVERED TO YOUR INBOX