Issue #5 March 2005

How I learned to stop worrying and love the command line, part 2

Introduction

In part 1 of this article, a number of powerful techniques and commands were explored for abusing the command line for the sake of quickly manipulating and extracting information that we wanted to reach. That information was a good start, and it proved we could extract quite a bit of valuable information with relatively simple tools, but it was only the appetizer to our main course—Perl.

Perl has been called the Swiss-Army Knife in a variety of contexts, including sysadminery, and it is a title that fits. On the whole, Perl is an exceedingly versatile language that, unlike many languages, grew and evolved to fit a more practical need than to satisfy academic curiosities. In particular, it originally came about as a tool for system administrators, particularly as a more robust and powerful supplement and replacement for tools like Sed and Awk. This type of evolution has its upsides and downsides. On the upside, it means that you end up with a very powerful and flexible language; on the downside, it sometimes aims to please too many audiences and lacks some of the rigor more traditional languages provide.

It is that flexibility we aim to exploit here, though, making Perl an excellent choice. The concepts we use here, though, will be applicable to a class of other languages. While you likely will never use Java for command line trickery (bless your heart if you try), other languages are somewhat well suited. In particular, Ruby, a relatively new language, even goes so far as to mimic some of Perl's command line options. But for us, we focus on Perl.

Getting started

At the conclusion of our previous article, an invocation of Perl was offered, but with no explanation of how it worked. That command was:

find /tmp/library -print0 | xargs -0 perl -p -i -e 's/XFree86/x.org/g'

The find and xargs are already familiar to us, as is the -e option to perl. What was left unexplained were the -p and the -i options. Let us begin with -p. Simply put, -p tells Perl to execute whatever you tell it (either a script or, more commonly, an expression supplied with -e) on each and every line of input, and then print out the $_ variable. Let us examine a simpler version:

seq 1 10 | perl -p -e '$_ = "prefix: $_"'

The seq command simply provides the numbers 1 through 10 as standard output which gets directed to Perl who then begins reading each line and executing what is specified by -e. In this case, we overwrite the variable $_ with a new value, "prefix: $_"—in other words, we put "prefix: " in front of it. The resulting output is:

prefix: 1
prefix: 2
prefix: 3
... snip ...
prefix: 10

Simple, but important, so let's see another example, one that is perhaps a bit more practical. In this one, we will process /etc/password, removing everything after (and including) the first colon:

perl -p -e 's/:.*//' /etc/passwd

This command displays the usernames of all users on your system. In this case, the expression is 's/:.*//'—a regular expression substitution. In Perl, the $_ is a magical variable that many things use if not given another value. In this case, when using the -p operator, it is set to be the entire current line of input.

By itself, the -p operator is useful for transforming streams of data, but when combined with -i, it is half of one of the most powerful command line tricks in any arsenal. In effect, -i means to perform changes in place. So instead of -p printing to standard out, the effect is that it reads a line from a file, evaluates the given expression on the line, and prints the line back to the original file. This mechanic is the core of a number of exceedingly powerful time saving tricks.

We now have the requisite pieces to determine what the command line from the previous article did. Revisiting:

find /tmp/library -print0 | xargs -0 perl -p -i -e 's/XFree86/x.org/g'

We know that the find/xargs combination feeds each file found to the Perl command perl -p -i -e 's/XFree86/x.org/g'. We also know that the -p and -i mean that the given expression is evaluated for each line and that line is replaced with the results. The expression in this case replaces the string XFree86 with x.org. The ultimate result is that any file in /tmp/library will have XFree86 replaced with x.org.

The -p operator has a cousin that acts very similar, but instead of printing, it executes the expression. That cousin is the -n operator, and it is somewhat useful on its own but more useful in conjunction with other operators we shall see shortly. It still has a use, though, with the operators we have seen so far:

perl -n -e '@fields = split /:/; print "$fields[0]\n"' /etc/passwd

This, much like one of the earlier examples, prints all of the users listed in /etc/passwd. If you think about it a moment, though, it is very common when working with Linux that you encounter files, like /etc/passwd, which are delimited by a common character—a colon in this case. So common, in fact, that Perl offers you a shortcut. Let us consider the simplest example—a file with fields delimited by whitespace. One such file is a typical Apache access_log file:

172.31.29.101 - - [04/Jan/2005:21:56:44 -0500] "GET /favicon.ico HTTP/1.1" 404 291 "-" "Mozilla/5.0"

That's a lot of information! Sometimes we only want some subset of it. For instance, suppose we want a list of the actual requested URLs. Counting whitespace, that is the seventh field. We can get at those fields by combining the -n parameter with -a, which instructs Perl to automatically split each line. Now inside of the evaluated expression you can not only access the entire line via $_ but you can also access each field through the @F array (counterintuitive, I know). In this case, the @F array contains:

$F[0] = '172.31.29.101'
$F[1] = '-'
$F[2] = '-'
$F[3] = '[04/Jan/2005:21:56:44'
$F[4] = '-0500]'
$F[5] = '"GET'
$F[6] = '/favicon.ico'
$F[7] = 'HTTP/1.1"'
$F[8] = '404'
$F[9] = '291'
$F[10] = '"-"'
$F[11] = '"Mozilla/5.0"'

So now to get a list of all of the URLs that were hit:

perl -l -a -n -e 'print $F[6]' /var/log/httpd/access_log

The -l and -e are familiar; all that is new is the -a working in conjunction with -n.

Practical magic

The last example begins to show us the power of combining tools together. Let's build on that example and do some log file exploration.

The problem

What are my website's most popular pages?

The solution
cat /var/log/httpd/access_log | perl -l -a -n -e 'print $F[6]' | sort | uniq -c | sort -n | tail -10
The explanation

This builds on the previous example; all that is new are the sequences of sort, uniq, and tail. The first sort is merely to prep for uniq (uniq expects its input to already be sorted). In this case, though, we're not asking uniq just for the uniq lines—asking for the number of times each line occurred with the -c parameter. The format of uniq's output in this case is:

18 /robots.txt
37 /favicon.ico

The first number is the count; the second is the line. We then pipe this back into sort, this time sorting numerically, then we use tail to extract the last ten lines (which in this case are the top ten most popular pages).

The problem

Who is attacking my web server?

The solution
cat /var/log/httpd/access_log | perl -l -a -n -e 'print $F[0]' | sort | uniq -c | sort -n | tail -10
The explanation

This is exactly the same as the previous example, except instead of the sixth field, we are extracting the zeroth field, which, in this case, is the requesting IP address.

The problem

You need to make changes to a large number of files in-place.

The solution
find -type f -name '*.txt' | xargs perl -p -i -e 's/PLACEHOLDER/new_value/g'
The explanation

The key to this example is we use find to select the files and xargs to apply them to Perl. This is just an extension of what we saw before with xargs, but now we use it with Perl to make large amounts of changes, far more than might fit on a single command line. The alternative would be to load your favorite editor (Emacs, of course), maybe make a macro, and execute it on each file... something you may not have time for if you need to change hundreds of files. Not to mention, much more error prone.

The problem

You suspect there may be multiple users with the same numeric uid in /etc/passwd.

The solution
perl -F: -l -a -n -e 'print $F[2]' /etc/passwd | sort | uniq -d
The explanation

The new concept here is the -F: parameter. This just tells Perl that, instead of splitting on whitespace, split on colons (or, in general, any string or regular expression immediately following the -F parameter). In /etc/passwd, the user's numeric id is the third field (which is $F[2]. Remember, engineers count from zero!). We then sort it in preparation to pass to uniq, where we use the -d switch (which prints only duplicates; uniq certainly is versatile).

The problem

You want to know how much data your webserver transferred today.

The solution
cat /var/log/httpd/access_log | perl -l -a -n -e '$n += $F[9]; } END { print $n'
The explanation

Things are starting to get weird. The parameters to Perl are all ones we know, but the expression is peculiar. In fact, it's nearly unreadable unless you already know how it works. How it works, though, is a bit odd. In effect, the -n operator has Perl do this for you:

while (<>) {
  # perl's magic to split $_ into @F

  # code specified by -e goes here
}

Basically Perl just places what is specified with -e into a while loop. So in this case, let's perform the substitution manually:

while (<>) {
  # perl's magic to split $_ into @F

  $n += $F[9]; } END { print $n
}

Adjusting formatting and we have...

while (<>) {
  # perl's magic to split $_ into @F

  $n += $F[9];
}

END {
  print $n
}

Ah, now it gets to be a little clearer. In Perl, a block specified via END { ... } will be executed when the Perl interpreter exits. So in this case, for each line of the input, Perl adds the tenth field to a variable called $n. When the interpreter exits, we print that value.

This is a pretty odd one, but it turns out it is extremely useful. Different than before, where we just manipulated each line, we instead accumulate information and display it at the end.

The problem

You have a logfile that has timestamps in epoch time but your brain doesn't read Unix epoch time. An example line is:

1104903117 0.3
The solution
cat /tmp/weirdlog | perl -l -a -n -e 'print scalar(localtime $F[0]), " @F[1..$#F]"'
The explanation

Sometimes, log files have epoch time instead of human readable time. That's okay, considering how complicated timezones and locales can be, but it makes it hard on our brain's wetware to process effectively. So, we resort to another bit of magic. In this case, we see a few new Perl constructs but nothing new on the way we invoke Perl.

The first construct is turning $F[0], which contains a number like 1104903117, into a human readable date. This is accomplished by the localtime function. localtime can return either an array of values containing various parts of the date such as the month and year or it can return a nice string representation; we force that string representation with the scalar function.

The other construct is " @F[1..$#F]". This is basically the same as "@F" except we only get entries 1, 2, 3, etc, through $#F. $#F is Perl's way of saying :the last valid index in the @F array." Another way of accomplishing this would be:

cat /tmp/weirdlog | perl -l -a -n -e '$F[0] = scalar localtime $F[0]; print "@F"'

Same idea, except we just change @F in place and then print it. One key thing to remember is we aren't using -i, so we won't actually change /tmp/weirdlog (though in this case, the use of cat protects us from that as well; even with -i, Perl can only change files in place if they were on the command line and not if they were streamed to it via stdin).

Conclusion

Hopefully you now have a taste for some of the kinds of magic tricks that can be performed with creative use of Perl and bash. These articles present a set of building blocks that you should take and creatively reuse on your own. Next time you find yourself editing a file and making a large number of the same changes or trying to string together grep and sed and other commands in painful ways, think about how you might solve the problem just by bringing Perl into the mix.

A word of caution, though; with great power comes great responsibility, and the techniques presented here are indeed powerful. One of the main reasons we can get away with such unreadable, arguably ugly code is that no one else will ever read it. We aren't saving it to a file for reuse, and it isn't like someone else will take our bash history and extend it. The corollary of that, though, is that if you are writing a script, you might want to shy away from overly clever constructs like some of what we've seen here. The only thing worse than saving one of these constructs and having to debug it six months later is running across one you didn't write and having to debug it six months later! Have compassion for your fellow coworkers when it comes to scripting.

One of the great things about using a tool like Perl in this case is that everything you learn about Perl, you can apply both in larger, more formal scripts as well as on the command line. In effect, you double the use of things you learn which is a very good thing. The 'scalar localtime' idiom from the last example, for instance, is very common in non-command line Perl, but it ports right over from more regular scripting to command line.

One of the best ways to learn more creative command line magic is to learn more about Perl itself. There are a variety of resources for this, but assuming you know basic Perl, then the Perl Cookbook is a wonderful resource well-suited for tricks like the ones demonstrated in this article. Likewise, the Perl man pages (which are excellent in quality) provide quite a bit of useful information, most especially man perlrun which documents the various command line switches used in this article.

About the author

Chip Turner has been with Red Hat for three years and is the Lead Architect for Red Hat Network. He also maintains the perl package, all perl modules, and spamassassin for Fedora Core and Red Hat Enterprise Linux as well as authors the RPM2 perl bindings for RPM and a handful of other CPAN perl modules. In his spare time he enjoys playing with his dog and arguing for no apparent point.