Skip to main content

Getting started with regular expressions: An example

Dive right into a regular expression example in this second of four regular expression articles.
Image
Getting started with regular expressions
"BXP135671" by tableatny is licensed under CC BY 2.0

In Introducing regular expressions, I covered what they are and why they’re useful. Now, we need a real-world example to use as a learning tool. Here is one I encountered several years ago.

This example highlights the power and flexibility of the Linux command line, especially regular expressions, for their ability to automate common tasks. I have administered several listservs during my career and still do. People send me email addresses to add to those lists. In more than one case, I have received a list of names and email addresses in a Microsoft Word format to be added to one of the lists.

The troublesome list

The list itself was not very long, but it was inconsistent in its formatting. An abbreviated version of that list, with name and domain changes, is shown here:

Team 1	Apr 3 
Leader  Virginia Jones  vjones88@example.com	
Frank Brown  FBrown398@example.com	
Cindy Williams  cinwill@example.com	
Marge smith   msmith21@example.com 
 [Fred Mack]   edd@example.com	

Team 2	March 14
leader  Alice Wonder  Wonder1@example.com	
John broth  bros34@example.com	
Ray Clarkson  Ray.Clarks@example.com	
Kim West    kimwest@example.com	
[JoAnne Blank]  jblank@example.com	

Team 3	Apr 1 
Leader  Steve Jones  sjones23876@example.com	
Bullwinkle Moose bmoose@example.com	
Rocket Squirrel RJSquirrel@example.com	
Julie Lisbon  julielisbon234@example.com	
[Mary Lastware) mary@example.com

The original list had extra lines, characters like brackets and parentheses that need to be deleted, whitespace such as spaces and tabs, and some empty lines. The format required to add these emails to the list is <first> <last> <email@example.com>. Our task is to transform this list into a format usable by the mailing list software.

It was obvious that I needed to manipulate the data in order to mangle it into an acceptable format for inputting to the list. It is possible to use a text editor or a word processor such as LibreOffice Writer to make the necessary changes to this small file. However, people send me files like this quite often, so it becomes a chore to use a word processor to make these changes. Despite the fact that Writer has a good search and replace function, each character or string must be replaced singly, and there is no way to save previous searches.

Writer does have a powerful macro feature, but I am not familiar with either of its two languages: LibreOffice Basic or Python. I do know Bash shell programming.

I did what comes naturally to a sysadmin—I automated the task. The first thing I did was to copy the address data to a text file so I could work on it using command-line tools. After a few minutes of work, I developed the Bash command-line program shown in the previous article:

$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/\]//g" -e "s/)//g" | awk '{print $1" "$2" <"$3">"}' > addresses.txt

This code produced the desired output as the file addresses.txt. I used my normal approach to writing command-line programs like this by building up the pipeline one command at a time.

Let’s break this pipeline down into its component parts to see how it works and fits together. All of the experiments in this series should be performed as a non-privileged user. I also did this on a VM that I created for testing: studentvm1.

The sample file

First, we need to create the sample file. Create a directory named testing on your local machine, and then copy the text below into into a new text file named Experiment_6-1.txt, which contains the three team entries shown above.

Team 1  Apr 3 
Leader  Virginia Jones  vjones88@example.com
Frank Brown  FBrown398@example.com
Cindy Williams  cinwill@example.com
Marge smith   msmith21@example.com 
 [Fred Mack]   edd@example.com  

Team 2  March 14
leader  Alice Wonder  Wonder1@example.com
John broth  bros34@example.com  
Ray Clarkson  Ray.Clarks@example.com
Kim West    kimwest@example.com 
[JoAnne Blank]  jblank@example.com

Team 3  Apr 1 
Leader  Steve Jones  sjones23876@example.com
Bullwinkle Moose bmoose@example.com
Rocket Squirrel RJSquirrel@example.com  
Julie Lisbon  julielisbon234@example.com

Removing unnecessary lines with grep

The first things I see that can be done are a couple of easy ones. Since the team names and dates are on lines by themselves, we can use the following to remove those lines that have the word "Team:"

[student@studentvm1 testing]$  cat Experiment_6-1.txt | grep -v Team

I won’t reproduce the results of each stage of building this Bash program, but you should be able to see the changes in the data stream as it shows up on STDOUT, the terminal session. We won’t save it in a file until the end.

In this first step in transforming the data stream into one that is usable, we use the grep command with a simple literal pattern, Team. Literals are the most basic type of pattern we can use as a regular expression, because there is only a single possible match in the data stream being searched, and that is the string Team.

We need to discard empty lines, so we can use another grep statement to eliminate them. I find that enclosing the regular expression for the second grep command in quotes ensures that it gets interpreted properly:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$"
Leader  Virginia Jones  vjones88@example.com
Frank Brown  FBrown398@example.com
Cindy Williams  cinwill@example.com
Marge smith   msmith21@example.com 
 [Fred Mack]   edd@example.com  
leader  Alice Wonder  Wonder1@example.com
John broth  bros34@example.com  
Ray Clarkson  Ray.Clarks@example.com
Kim West    kimwest@example.com 
[JoAnne Blank]  jblank@example.com
Leader  Steve Jones  sjones23876@example.com
Bullwinkle Moose bmoose@example.com
Rocket Squirrel RJSquirrel@example.com  
Julie Lisbon  julielisbon234@example.com
[Mary Lastware) mary@example.com
[student@studentvm1 testing]$

The expression "^\s*$" illustrates anchors, and using the backslash (\) as an escape character to change the meaning of a literal "s" (in this case) to a metacharacter that means any whitespace such as spaces, tabs, or other characters that are unprintable. We cannot see these characters in the file, but it does contain some of them.

The asterisk, aka splat (*), specifies that we are to match zero or more of the whitespace characters. This addition would match multiple tabs, multiple spaces, or any combination of those in an otherwise empty line.

Viewing extra whitespace with Vim

Next, I configured my Vim editor to display whitespace using visible characters. Do this by adding the following line to your own ~.vimrc file, or to the global /etc/vimrc configuration file:

set listchars=eol:$,nbsp:_,tab:<->,trail:~,extends:>,space:+

Then, start—or restart—Vim.

I have found a lot of bad, incomplete, and contradictory information on the internet in my searches for how to do this. The built-in Vim help has the best information, and the data line I created from that above is one that works for me.

Note: In the example below, regular spaces are shown as +; tabs are shown as <, <>, or <–>, and fill the length of the space that the tab covers. The end of line (EOL) character is shown as $.

The result, before any operation on the file, is shown here:

Team+1<>Apr+3~$
Leader++Virginia+Jones++vjones88@example.com<-->$
Frank+Brown++FBrown398@example.com<---->$
Cindy+Williams++cinwill@example.com<--->$
Marge+smith+++msmith21@example.com~$
+[Fred+Mack]+++edd@example.com<>$
$
Team+2<>March+14$
leader++Alice+Wonder++Wonder1@example.com<----->$
John+broth++bros34@example.com<>$
Ray+Clarkson++Ray.Clarks@example.com<-->$
Kim+West++++kimwest@example.com>$
[JoAnne+Blank]++jblank@example.com<---->$
$
Team+3<>Apr+1~$
Leader++Steve+Jones++sjones23876@example.com<-->$
Bullwinkle+Moose+bmoose@example.com<--->$
Rocket+Squirrel+RJSquirrel@example.com<>$
Julie+Lisbon++julielisbon234@example.com<------>$
[Mary+Lastware)+mary@example.com$

Removing unnecessary characters with sed

You can see that there are a lot of whitespace characters that need to be removed from our file. We also need to get rid of the word "leader," which appears twice and is capitalized once. Let’s get rid of "leader" first. This time, we will use sed (stream editor) to perform this task by substituting a new string—or a null string in our case—for the pattern it matches.

Adding sed -e "s/[Ll]eader//" to the pipeline does this:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//"

In this sed command, -e means that the quote-enclosed expression is a script that produces a desired result. In the expression, the s means that this is a substitution. The basic form of a substitution is s/<regex>/<replacement string>/, so /[Ll]eader/ is our search string.

The set [Ll] matches L or l, so [Ll]eader matches leader or Leader. In this case, the replacement string is null because it looks like a double forward slash with no characters or whitespace between the two slashes (//).

Let’s also get rid of some of the extraneous characters like []() that will not be needed:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g"

We have added four new expressions to the sed statement. Each one removes a single character. The first of these additional expressions is a bit different, because the left square brace ([) character can mark the beginning of a set. We need to escape the brace to ensure that sed interprets it correctly as a regular character and not a special one.

Tidying up with awk

We could use sed to remove the leading spaces from some of the lines, but the awk command can do that, reorder the fields if necessary, and add the <> characters around the email address:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

The awk utility is actually a powerful programming language that can accept data streams on its STDIN. This fact makes it extremely useful in command-line programs and scripts.

The awk utility works on data fields, and the default field separator is spaces—any amount of white space. The data stream we have created so far has three fields separated by whitespace (<first>, <last>, and <email>):

awk '{print $1" "$2" <"$3">"}'

This little program takes each of the three fields ($1, $2, and $3) and extracts them without leading or trailing whitespace. It then prints them in sequence, adding a single space between each as well as the <> characters needed to enclose the email address.

Wrapping up

The last step here would be to redirect the output data stream to a file, but that is trivial, so I leave it with you to perform that step. It is not really necessary that you do so.

I saved the Bash program in an executable file, and now I can run this program anytime I receive a new list. Some of those lists are fairly short, as is the one in this example. Others have been quite long, sometimes containing up to several hundred addresses and many lines of "stuff" that do not contain addresses to be added to the list.

Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux book, Using and Administering Linux: Zero to SysAdmin, due out from Apress in late 2019.

Topics:   Linux   Regular Expressions  
Author’s photo

David Both

David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh, North Carolina. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for nearly 50 years. More about me

Related Content

OUR BEST CONTENT, DELIVERED TO YOUR INBOX