Issue #4 February 2005

How I learned to stop worrying and love the command line, part 1

Introduction

You've always been told to write maintainable code. All of those fancy books on Extreme Programming and every computer science course you've ever had has emphasized commenting and clarity and all of those other broccoli-is-good-for-you-so-clean-your-plate directives. This article, and its second half, are about the opposite of that—unreadable code, inscrutable code, and disposable code. But, also, indispensable code. It is the editor we will use that will be the dominating factor in the way we write our code, however, and that editor is the bash command line prompt.

The first part of this series focuses on that versatile editor and the magic you can weave by combining fundamental concepts of UNIX with a healthy disregard for public safety. The second part will specialize a bit and focus on using what you learn in the first article when combined with the ubiquitous system administrator survival knife of a language, perl. The goal, however, isn't just to walk on the wrong side of the tracks and live to tell the tale; quite the opposite. The goal is to become more efficient, solve problems that otherwise would be very time consuming, and maybe, just maybe, impress people while we're at it. After all, any "sufficiently advanced technology is indistinguishable from magic.1"

After all, any "sufficiently advanced technology is indistinguishable from magic."

The players

Before we dive into those specific facilities, though, we should discuss what bash brings to the party, because if perl is the magic we brew here, then bash is the cauldron we brew it in. Central to all UNIX shells (and even non-UNIX shells, though to a lesser extent) is the ability to take the output of one command and save it to a file or to send it as input to another program—that is, input and output redirection. Nearly every trick we use here involves redirection through one or more pipes. Simple things like grep https /etc/services | grep -v udp show the power of how using grep twice in a row is much simpler than coming up with a possibly complex regular expression to achieve the same result—print every line of /etc/services that contains the string https but not the string udp.

bash isn't just about running external programs, however. Built into bash is as full-featured a list of programming constructs as you'd expect in any language—conditionals such as if and case as well as variables (even arrays) and iteration constructs such as for and while. bash even documents itself, via the help command—if at any time you wonder the syntax of any of bash's commands such as whether if clauses end with fi or endif, all you need do is invoke help with that command as a parameter—help if.

For example, a very common task is perform a set of operations on a number of files. Quite often, if that operation is simple, the facility may already be there—rm * for instance removes every file in the current directory. However, if your operation is more complex, such as 'delete all of the symlinks in this directory' or 'delete every file containing the word "violet" in the current directory' then chances are no single command will solve that problem.

Using bash constructs such as for and if, though, make this easy. Take the first example, deleting all of the symlinks in the current directory. Quite simple, with bash:


for FILE in *
do
  if [ -l $FILE ]
  then
    rm $FILE
  fi
done

In English, this just means: Iterate over every file in the current directory, assigning the name of the file to the variable FILE. If that file is a symlink, then remove the file.

Perhaps the least obvious part of this construct is the if [ -l $FILE ] statement. Contrary to many languages, the [ and ] are not grouping like parentheses; instead, [ is actually the name of a bash built-in function, and the ] is just there for decoration. The [ command is the same as the test command, which performs a variety of tests such as string equality, file existence, and, in this case, whether a given file is a symlink or not. The full list of operations can be seen via help test—definitely worth a read to see just how many checks that may take many lines in other programs are quite simple with bash.

Sometimes you will see the above shortened into this more compact form:


for FILE in *
do
  [ -l $FILE ] && rm $FILE
done

or even:


for i in *; do [ -l $i ] && rm $i; done

Both of these are simply making the statement more compact. In the first case, the && operates much like it would in C or Perl—if the first condition is true (if the file is a symlink) then evaluate the second condition (remove the file). Also like C and Perl, though, is that bash won't evaluate the right hand side of the && (or a ||) if it doesn't have to. So if the conditional check fails, it won't hit the rm command. The third form simply changes FILE to i (a common iteration variable) and crams it all onto one line. Note the placement of semicolons—they are crucial, else bash won't consider the command well-formed.

Stringing commands together

Another common bash construct you will see is inline expansion. Simply put, this runs the given command and places the output inside the current command. For instance:


echo "The current time is: $(date)"

displays something to the effect of:


The current time is: Thu Feb  3 20:50:35 EST 2005

You also often see the so-called backtick operator:


echo "The current time is: `date`"

The two forms basically do the same thing; however, the $() form allows for nesting and is considerably easier to read, and is encouraged over the `` form.

Since this expansion can occur at any point, it can be used in a for statement. The basic syntax for the for statement is for VARIABLE in LIST; do COMMANDS; done where LIST is a space separated list of values. Here is an example of how to create ten files, 1.txt through 10.txt:


for i in $(seq 1 10); do touch $i.txt; done

That creates 10 files named 1.txt through 10.txt each of which is empty. This is a quick and easy way to repeat a command N times as well (there is no requirement that the variable used for iteration be referenced in the command).

Another use of $() is to provide a list of files to another command. For instance, to see how many lines each of the text files in the current directory that contain the string fedora contain, you could simply use:


wc -l $(grep -l fedora *.txt)

This introduces the grep command. grep is a standard utility and not a bash built-in command, and it is one of the most important commands you will use when doing complex scripting. Basically grep searches files for a given string or pattern. The invocation is simple—grep PATTERN FILE [ FILE ... ]. PATTERN is a regular expression and can be rather complex, but for our purposes here, the pattern fedora simply matches the literal string fedora. Usually grep prints both the file and the matching line, but in this case, we pass the -l option, which tells it simply to print the filename—quite useful when wanting to operate on files that contain a pattern.

There is a dirty secret to this, though: like most things when it comes to computers, there is a limit. In this case, the limit is on how large the kernel allows a single command line to be. Although the default is quite spacious for normal editing, using operations like $() and even just normal wildcard expansion can run past that limit very quickly. If there were, say, 5000 files in the current directory, and they all contained fedora, then we could run out of space on the command line. There is an answer, though, and it, like grep, isn't part of bash but it is exceedingly useful.

That command is xargs. Although an entire article could likely be written on xargs alone, in a nutshell, xargs is simply a way to do $() without worrying about command line size limits. For example, the previous grep -l fedora *.txt would transform into:

find . -name '*.txt' -maxdepth 1 | xargs grep -l fedora

Whew, that got more complicated. For the moment, ignore the find bit and pretend it just lists all of the .txt files in the current directory and prints them to stdout (which is the same thing ls *.txt would do, but remember we can't do *.txt because there are too many files; find gets around this by being told the pattern, in quotes, so the shell won't expand it and parsing the pattern on its own). Next comes xargs, followed by the command we want to run. That part, at least, is fairly simple, but what is going on under the hood is a bit more involved.

What xargs does is read from stdin then construct a command to execute. The parameters to xargs determine what the command starts with. In other words, it begins a command with grep -l fedora and then begins tacking on everything it reads from stdin. It knows the limit on the command line, though, so once it packs as many parameters as it can up until the point where the next parameter would, it executes the command. Once the command completes, it begins again. It repeats this, executing the command as many times as necessary to process the input.

But we still don't have the line counts we were looking for—all we have is a list of files containing fedora. Ah-hah! That list is the output of the xargs command. We can take that output and make it the input of the wc command, once again using xargs:

find . -name '*.txt' -maxdepth 1 | xargs grep -l fedora | xargs wc -l

There we go, just what we were looking for. It certainly became more complicated, but it also became more robust; it would work on any number of files, be it one or one million.

find

At first glance, the find command above was rather complicated, not to mention fairly unlike most commands. In particular, the parameters were specified with one dash, not two. However, idiosyncrasies aside, find is a tremendously useful command. Unlike bash and perl, of which there is only one implementation for each, it is one of those tools that has major variants depending on your UNIX of choice. The variants divide into two categories, though— GNU find and everyone else's. GNU find lets you get away with some laziness, and this article uses GNU find syntax, but be aware it may not transition exactly the same to other UNIX-like operating systems.

As the name suggest, find is good for finding things. In this case, though, things turn out to be files. find excels at very selectively finding files matching some set of rules. One particularly useful feature of find is that it, by default, recurses into subdirectories. That means:

find /tmp -name '*.txt'

locates all .txt files in /tmp and in all subdirectories of /tmp. Usually this is what you want; quite often, you will find yourself working with entire trees. Sometimes, though, you only want files in the current directory, or those at most one directory deep. That is where the -maxdepth option that appeared in a previous example comes in; basically it is limiting the depth that find recurses.

Quite often, you will see find uses in conjunction with xargs, not even considering the issues with maximum command line lengths. It is not easy with bash to specify all of the .txt files in the current subdirectory and below. Sure, you could use *.txt */*.txt */*/*.txt but that will only go three levels deep, and for a deeply nested set of directories, that simply isn't enough.

There is a hidden trap, though. Suppose we want to compress all .txt files in an entire subtree. Simple enough, using what we've learned before:

find /path/to/tree -name '*.txt' | xargs gzip

The trap, though, is what if one of the files had a space in its name? Whitespace is what xargs uses to delimit files in its input. Therefore, it treats /foo/bar/a file.txt as two arguments—/foo/bar/a and file.txt. Certainly not what we want. As this is a rather common problem, find and xargs both come with what is needed to make our example function properly—using null characters (ASCII 0) to delimit the files, instead of whitespace.

find /path/to/tree -name '*.txt' -print0 | xargs -0 gzip

To see what is going on, try running the find command alone; depending on your terminal, you will likely see what looks like one rather huge single line with all of the filenames smashed together. Actually, though, there is the hidden null between each filename. The -0 parameter on xargs tells xargs that nulls will delimit incoming filenames. Problem solved. In practice, most files don't contain spaces, but when they occur, it is essential to know the proper response (much like find and xargs themselves being the proper response when you run into command line length limits).

Another extremely useful way find can sift through files is to find files created or modified recently. Often you want to know what has changed recently. For instance, to list all of the files in your home directory that changed within the past two days:

find ~/ -mtime -2

To find the files that haven't been modified in the past two days, you can change the -mtime parameter:

find ~/ -mtime +2

You can also select files by the last time they were accessed (atime) or created (ctime). Like bash's test command, find has a wide variety of options; reading the manpage is advised (not just for reference, either; it will give you an idea of the flexibility of this peculiar command).

Sometimes output isn't in the order you want it. For instance, 'find' doesn't print in alphabetical (or, more accurately, lexicographical) order. Likewise, du doesn't display files largest to smallest (or vice versa). Instead, we must use another command to sort such output—a command appropriately named 'sort'. Effectively, sort can take any sized input (subject to local disk space) and sort it numerically or lexicographically by any position in the string (not just starting at the first character of each line). For example, suppose you want to find the files with the most lines inside a directory tree:

find /usr | xargs wc -l | sort -n

Here we see our friends find and xargs; what they do in this case should be fairly clear this time. Next, though, comes the sort; referring to the output of du, we know it produces, by default, the size in kilobytes of the file, then the filename. sort will, by default, begin sorting based on the first character of each line, and, by default, it sorts lexicographically. The -n option, though, tells it to sort numerically. The result here is we see the smallest files first, all the way to the largest at the very bottom.

But, that certainly is a lot of output, especially if we just want to know, say, the five longest files. Fortunately, there is a way to take the output of a program and throw out all but the first or last few lines. Respectively, those commands are head and tail. Both operate basically the same way—they read from stdin and only produce the first few lines or the last few lines, respectively. So in this case, we would transform the command to:

      
find /usr | xargs wc -l | sort -n | tail

That would show the last ten lines of the output, or, in this case, the ten files with the most lines.

A taste of perl

The other big standalone program we care about is, of course, perl itself. The perl executable is not only the binary we use to launch regular perl scripts, but it also has a number of command line options that make it very well suited for intermixing on the command line to filter or alter the input and output of other programs. But first, we need to be able to execute arbitrary perl, which is the very core of what we will do. This is done via the -e operator:

perl -l -e 'print 1024 * 1024'

The result here is that we see what 1024 times 1024 is (the -l, which we will almost always use, tells perl to print a newline by default with every print statement; leave it off to see what happens). In fact, I frequently just do a quick perl -le instead of reaching for a calculator when I need a simple calculation—it almost always is faster to drop to a command line than start a separate application.

We finish with a small taste of what we will see in the next article. One of the most common changes one might want to make to a file or set of files is to replace one string with another. Certainly, you could open your favorite editor and do such a change to a single file, or even a handful, but what if you need to change dozens, or even hundreds? Thankfully, there is an easy way with a nice command line perl trick:

perl -p -i -e 's/XFree86/x.org/g' file1.txt file2.txt ...

Simply put, this replaces XFree86 any time it occurs in the listed files with x.org. One can easily imagine combining this with find and xargs to change huge sets of files:

find /tmp/library -print0 | xargs -0 perl -p -i -e 's/XFree86/x.org/g'

But what does the -p and the -i mean? Stay tuned, those questions and more will be answered in the second part of this article.

Conclusion

Hopefully you now have a taste for some of the kinds of magic tricks that can be performed with creative use of bash and some of the more common utilities one finds in most UNIXes. This article is simply a set of building blocks that you should take and creatively reuse on your own. The second part will build upon this foundation and explore some more advanced tricks, including in-depth coverage of everything you need to become a command line deity.

About the author

Chip Turner has been with Red Hat for three years and is the Lead Architect for Red Hat Network. He also maintains the perl package, all perl modules, and spamassassin for Fedora Core and Red Hat Enterprise Linux as well as authors the RPM2 perl bindings for RPM and a handful of other CPAN perl modules. In his spare time he enjoys playing with his dog and arguing for no apparent point.

1 Arthur C. Clarke