Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


Building Unix Tools with Ruby

by Jacek Artymiak
09/18/2003

This article demonstrates how to write Ruby scripts that work like typical, well-behaved Unix commands. To make it more fun and useful, we'll write a command-line tool for processing data stored in the comma separated values (CSV) file format. CSV (not CVS) is used to exchange data between databases, spreadsheets, and securities analysis software, as well as between some scientific applications. That format is also used by payment processing sites that provide downloadable sales data to vendors who use their services.

CSV files are plain text ASCII files in which one line of text represents one row or data and columns are separated with commas. A sample CSV file is shown below.

ticker,per,date,open,high,low,close,vol
XXXX,D,3-May-02,83.01,83.58,71.13,78.04,9645300
XXXX,D,2-May-02,82.47,85.76,82.05,83.84,7210000
XXXX,D,1-May-02,86.80,90.83,81.74,85.50,14253300

What Is the Script Supposed to Do?

The script, csvt, will extract selected columns of data from a CSV file. The output will also be a CSV file, and the user will be able to specify the order the columns of data will be printed in. A simple data integrity test will make csvt fail, when the number of columns in one line differs from the number of columns in the previous line. The source of data will be either a file or standard input (STDIN), as is customary for many Unix command line tools.

The utility will support the following options:

When csvt finds an unsupported option, or when it is run without any options, it will default to the behavior determined by --help.

Before You Begin

To complete this tutorial you will need an OS capable of running the Ruby interpreter, the Ruby interpreter itself, and a text editor. The operating system can be any POSIX-compatible system, either commercial (AIX, Solaris, QNX, Microsoft NT/2000, Mac OS X, and others) or free (Linux, FreeBSD, NetBSD, OpenBSD, or Darwin). The Ruby interpreter should be the latest release of Ruby. You can check if Ruby has been installed on your system with the following command:

$ ruby --version

Related Reading

Ruby in a Nutshell
By Yukihiro Matsumoto

When the system reports that there is no such file or directory, you can either download the latest Ruby binaries from the Ruby site or from one of repositories of ports and packages for your operating system (check the list of resources at the end of this article).

If ready-made binaries are not available, you can always build Ruby from original sources found at the Ruby site. Detailed instructions for building Ruby can be found in the README file found in the interpreter's source archive. If you get stuck support is available on comp.lang.ruby as well as on the Ruby-talk mailing list. (Subscription details are on the Ruby site).

The choice of text editor is largely a matter of personal preference. The author is a devoted vi user, but any text editor will do.

Start with the Help Screen

Every tool, no matter how small, should come with a manual or, at the very least, it should print a short help screen that explains its usage. It is a good habit to write documentation before writing the first line of code.

Since csvt is a simple tool with only five options, you can be forgiven for not writing the manual, but you should embed basic documentation in the script itself. This should be mandatory for even a short script that you are writing for your own use, because chances are good that you will forget what it does in two weeks.

The help screen shown above will be printed by csvt after the user makes a mistake or runs csvt without specifying any options. Since it can only occupy one standard text terminal screen (80 by 25 characters), it must be terse, but informative. Ideally, it should present the following information:

Your help screen could look like this (and it's okay just to type this stuff in a text editor and wrap it in code later):

csvt -- extract columns of data from a CSV (Comma-Separate Values) file
Usage: csvt [POSIX or GNU style options] file ...

POSIX options                     GNU long options
    -e col[,col][,col]...             --extract col[,col][,col]...
    -r col[,col][,col]...             --remove col[,col][,col]...
    -h                                --help
    -u                                --usage
    -v                                --version

Examples:
csvt -e 1,5,6 file             print column 1, 5 and 6 from file
csvt --extract 4,1 file        print column 4 and 1 from file
csvt -r 2,7,1 file             print all columns except 2, 7 and 1 from file
csvt --remove 6,0 file         print all columns except 6 and 0 from file
cat file | csvt --remove 6,0   print all columns except 6 and 0 from file

Send bug reports to bugs@foo.bar
For licensing terms, see source code

Because there are several cases where it might be necessary to display the help screen, you will need to put the code that displays it in a separate method. We'll call it printusage(). (It helps to have the source code of csvt handy)

def printusage(error_code)
    print "csvt -- extract columns of data from a CSV (Comma-Separate Values) file\n"
    print "Usage: csvt [POSIX or GNU style options] file ...\n\n"
    print "POSIX options                     GNU long options\n"
    print "    -e col[,col][,col]...             --extract col[,col][,col]...\n"
    print "    -r col[,col][,col]...             --remove col[,col][,col]...\n"
    print "    -h                                --help\n"
    print "    -u                                --usage\n"
    print "    -v                                --version\n\n"

    print "Examples: \n"
    print "csvt -e 1,5,6 file             print column 1, 5 and 6 from file\n"
    print "csvt --extract 4,1 file        print column 4 and 1 from file\n"
    print "csvt -r 2,7,1 file             print all columns except 2, 7 and 1 from file\n"
    print "csvt --remove 6,0 file         print all columns except 6 and 0 from file\n"
    print "cat file | csvt --remove 6,0   print all columns except 6 and 0 from file\n\n"
    print "Send bug reports to bugs@foo.bar\n"
    print "For licensing terms, see source code\n"

    exit(error_code)
end

printusage() takes one argument, error_code, which is later passed to exit()—a built-in Ruby method used to stop the script and return an error code. In your script printusage() will be called in two cases:

You should always remember to write code that returns appropriate error codes. When your script returns meaningful error codes, it is much easier to write scripts that can handle critical situations.

Read Command-Line Options and Arguments

The specification presented in an earlier section lists several options, which csvt should understand. Your script can access the list of options and arguments in two ways, reading them directly from the ARGV array (passed to your script automatically by the operating system) or using the GetoptLong module to parse ARGV for you. The latter method is preferred: it's easier and saves time.

GetoptLong is an external module, so it must be explicitly imported before you can use it:

require 'getoptlong'

After your script imports getoptlong, you will also need to create a new instance of GetoptLong:

opts = GetoptLong.new(
    [ "--extract",          "-e",   GetoptLong::REQUIRED_ARGUMENT ],
    [ "--remove",           "-r",   GetoptLong::REQUIRED_ARGUMENT ],
    [ "--help",             "-h",   GetoptLong::NO_ARGUMENT ],
    [ "--usage",            "-u",   GetoptLong::NO_ARGUMENT ],
    [ "--version",          "-v",   GetoptLong::NO_ARGUMENT ]
)

The arguments passed to GetoptLong.new are the names of the long and the short options, and the argument flags that finetune the behavior of the option parser implemented in GetoptLong. The example above shows how the csvt option specification is turned into code. It is a good habit to define both long and short options, but if for some reason it isn't possible or desired, you can omit them and put "" in place of either the long or the short option that you wish to leave undefined. The argument flags can be set to REQUIRED_ARGUMENT, NO_ARGUMENT, or OPTIONAL_ARGUMENT. The GetoptLong option and argument parser uses these settings to decide how it should interpret the contents of ARGV.

Once you have a properly initiated instance of the option parser, you can add code to checks which options have been selected and what mistakes have been made. GetoptLong provides a lot of help here; your job is limited to defining a few global variables and handling any errors that may occur at this stage.

First, let's define a few global variables:

version        = "0.0.1" # used by the --version or -v option handler
extract_f      = false   # set to true when --extract or -e are used
extract_args   = []      # stores the list of arguments of --extract or -e
remove_f       = false   # set to true when --remove or -r are used
remove_args    = []      # stores the list of arguments of --remove or -r
ex_options_n   = 0       # used to store the number of mutually exclusive
                         # options, when > 1, the script will terminate
have_options_f = false   # set to true when at least one option is used

Next, you need to check which options have been used. The general layout of the block of code responsible for testing this and setting appropriate parameters that will be used to change the behavior of csvt follows the pattern show below:

begin
    opts.each do |opt, arg|
        case opt
            when option
                 ... option handler ...
            when option
                 ... option handler ...
        end
    end

rescue
    ... handle exceptions ...
end

The begin-rescue-end construct that wraps the opts.each do loop is required to add the exception handler, rescue-end, that provides a way to gracefully handle unexpected situations. We need that handler, because we do not want the user to see the trace messages printed by the Ruby interpreter when GetoptLong raises an exception. A short error message and a help screen are much more user friendly.

Let's get down to the details. The opts.each do |opt, arg| loop reads options and their arguments, if any are expected:

begin
    opts.each do |opt, arg|

Should the value of opt be some undefined option (e.g., -w), GetoptLong will display a error message about unsupported option, throw an exception, and stop the execution of the script. This sounds a bit drastic, but as you will see in a moment, you can handle that situation easily.

If the value of opt is one of the known options (e.g., --extract), it will be examined by the following case control structure, which sets the extract_f flag and checks which columns from the source file the user wants to print.

Notice that it does not matter if the user uses the long or the short version of the --extract option. GetoptLong treats them both as the same option, which means that you only need to write one handler.

case opt
    when "--extract"
        extract_f    = true
        extract_args = arg.split(",")

        tmp = 0
        extract_args.each do |column|
            begin
                extract_args[tmp] = Integer(column)
                tmp += 1
            rescue
                $stderr.print "csvt: non-integer column index\n"
                printusage(1)
            end
        end

        ex_options_n   += 1
        have_options_f  = true

The --extract option handler sets the extract_f flag, splits the arguments that follow it (remember, these are numbers separated with commas), and checks if all arguments of --extract are numerical, integer indexes. When all goes well, the ex_options_n exclusive options counter is incremented and the have_options_f flag is set to indicate that at least one option was selected by the user. This is used to avoid ambiguity when the user selects mutually exclusive options.

Because the --extract and --remove options are quite similar in the way they work, their handlers are also almost identical (see below).

    when "--remove"
        remove_f    = true
        remove_args = arg.split(",")

        tmp = 0
        remove_args.each do |column|
            begin
                extract_args[tmp] = Integer(column)
                tmp += 1
            rescue
                $stderr.print "csvt: non-integer column index\n"
                printusage(1)
            end
        end

        ex_options_n   += 1
        have_options_f  = true

Requests for csvt version information are handled by the code shown below. Notice that it doesn't matter if other options were used. Once --version or -v are found, csvt prints version information and exits with 0 (no errors).

    when "--version"
        print $0, ", version ", version, "\n"
       exit(0)

Should the user need some help on csvt usage, our script displays the help screen and exits with 0.

    when "--help"
        printusage(0)

    when "--usage"
        printusage(0)
    end
end

Once the loop ends, it's time to check for possible errors like mutually exclusive and missing options. Both are considered errors and result in displaying an error message followed by the help screen.

#################################################################
# test for mutually exclusive options: --extract and --remove

if ex_options_n > 1
    $stderr.print $0, ": cannot use --extract (-e) and --remove (-r) together\n"
    printusage(1)
end

#################################################################
# test for missing options

if have_options_f == false
    printusage(1)
end

The last piece of the option-processing block of code is the exception handler, which prints the help screen, exits csvt, and returns error code 1.

rescue
    # all other errors
    printusage(1)
end

Your code should look like this now:

require 'getoptlong'

    version        = "0.0.1" # used by the --version or -v option handler
    extract_f      = false   # set to true when --extract or -e are used
    extract_args   = []      # stores the list of arguments of --extract or -e
    remove_f       = false   # set to true when --remove or -r are used
    remove_args    = []      # stores the list of arguments of --remove or -r
    ex_options_n   = 0       # used to store the number of mutually exclusive
                             # options, when > 1, the script will terminate
    have_options_f = false   # set to true when at least one option is used 

    def printusage(error_code)
        print "csvt -- extract columns of data from a CSV (Comma-Separate Values) file\n"
        print "Usage: csvt [POSIX or GNU style options] file ...\n\n"
        print "POSIX options                     GNU long options\n"
        print "    -e col[,col][,col]...             --extract col[,col][,col]...\n"
        print "    -r col[,col][,col]...             --remove col[,col][,col]...\n"
        print "    -h                                --help\n"
        print "    -u                                --usage\n"
        print "    -v                                --version\n\n"

        print "Examples: \n"
        print "csvt -e 1,5,6 file            print column 1,5 and 6 from file\n"
        print "csvt --extract 4,1 file       print column 4 and 1 from file\n"
        print "csvt -r 2,7,1 file            print all columns except 2,7 and 1 from file\n"
        print "csvt --remove 6,0 file        print all columns except 6 and 0 from file\n"
        print "cat file | csvt --remove 6,0  print all columns except 6 and 0 from file\n\n"
        print "Send bugs reports to bugs@foo.bar\n"
        print "For licensing terms, see source code\n"
        exit(error_code)
    end

    opts = GetoptLong.new(
        [ "--extract",     "-e",   GetoptLong::REQUIRED_ARGUMENT ],
        [ "--remove",      "-r",   GetoptLong::REQUIRED_ARGUMENT ],
        [ "--help",        "-h",   GetoptLong::NO_ARGUMENT ],
        [ "--usage",       "-u",   GetoptLong::NO_ARGUMENT ],
        [ "--version",     "-v",   GetoptLong::NO_ARGUMENT ]
    )

    begin
        opts.each do |opt, arg|
            case opt
                when "--extract"
                    extract_f    = true
                    extract_args = arg.split(",")

                    tmp = 0
                    extract_args.each do |column|
                        begin
                            extract_args[tmp] = Integer(column)
                            tmp += 1
                        rescue
                            $stderr.print "csvt: non-integer column index\n"
                            printusage(1)
                        end
                    end

                    ex_options_n   += 1
                    have_options_f  = true

                when "--remove"
                    remove_f    = true
                    remove_args = arg.split(",")

                    tmp = 0
                    remove_args.each do |column|
                        begin
                            remove_args[tmp] = Integer(column)
                            tmp += 1
                        rescue
                            $stderr.print "csvt: non-integer column index\n"
                            printusage(1)
                        end
                    end

                    ex_options_n   += 1
                    have_options_f  = true

                when "--help"
                    printusage(0)

                when "--usage"
                    printusage(0)

                when "--version"
                    print "csvt, version ", version, "\n"
                    exit(0)
            end
        end

        #################################################################
        # test for mutually exclusive options: --extract and --remove

        if ex_options_n > 1
            $stderr.print "csvt: cannot use --extract (-e) and --remove (-r) together\n"
            printusage(1)
        end

        #################################################################
        # test for missing options 

        if have_options_f == false
            printusage(1)
        end

    rescue 
        printusage(1)
    end

Get the Plumbing Right

With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.

It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.

Writing a Ruby script that fits into that scheme is actually very simple. The simplest piece of code that copies everything from STDIN to STDOUT is just three lines long:

while gets
    print 
end

Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.

$ cat file1 file2 | csvt -e 2,0
$ csvt -e 2,0 file1 file2

Processing Input

The simple loop shown in Section 6 is not very useful, because it it does not do any processing of input. It does illustrate the general concept. The csvt script will use two such loops, one for --extract and one for --remove. Both start with a test of the appropriate flag, extract_f for --extract and remove_f for --remove.

if extract_f == true
     first_f = true

The first_f flag is used to avoid the "off by one" error inside the while loop:

while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

Every loop cycle starts with a call to gets, which reads a new line from STDIN and stores it in $_. Next the script removes the end of line character and splits the line into an array of separate columns.

        if first_f
            old_data_n = data_n
               first_f = false
        end

The size of the array is stored in data_n. Then it tests if the line just read was the first line and sets the number of columns on the non-existent previous line to the number of columns on the first line to pass the data integrity check (comparing the number of columns in the previous and the current line).

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the "
                        + "following line does not match the number "
                        + "of fields on the previous line\n"
            $stderr.print $_
            exit(1)
        end

Should the data integrity test fail, the error message followed by the offending line will be printed to the system log and the execution of csvt will stop. It is tempting to relax the rules a little and introduce an option for skipping such errors, but that's a job for a separate tool; namely, a specialized data integrity checker, which is usually written with a particular data set in mind and therefore outside the scope of the csvt's specification.

When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:

line = ""

Next we travel the array of arguments for the --extract option. As you will notice, there is test check, if the column index is less than the number of fields in the line we just read. If it is, csvt will complain, suggest the allowed range of indexes and exit with code 1.

        extract_args.each do |column|
 
            if !(column < data_n)
                $stderr.print "csvt: column index out of range, "
                            + "use numbers between 0 and ", 
                              data_n - 1, "\n"
                exit(1)
            end

If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.

            line += data[column] + ","
        end

Once all columns listed as arguments of --extract have been processed, we can print the contents of the line variable, less the last character, which we replace with the end of line character.

print line[0, line.length-1], "\n"

The last thing is setting the old_data_n variable to the number of columns in the currently processed line, so the data integrity check can spot any errors.

        old_data_n = data_n
    end
end

So it goes until the end of the file or data stream. When all data is processed, our script ends with a call to exit(0).

The code used to process STDIN when the user chooses the --remove option is similar to the --extract handler, with a small twist after the line variable initialization.

if remove_f == true
    first_f = true

    while gets
        data   = $_.chop
        data   = data.split(",")
        data_n = data.length

        if first_f
            old_data_n = data_n
               first_f = false
        end

        if data_n != old_data_n
            $stderr.print "csvt: the number of fields on the following "
                        + "line does not match the number of fields on "
                        + "the previous line\n"
            $stderr.print $_
            exit(1)
        end

        line = ""

There is an additional loop that sets the columns whose indexes are listed as arguments of --remove to "".

        remove_args.each do |column|

            if !(column < data_n)
                $stderr.print "csvt: field index out of range, "
                            + "use numbers between 0 and ", 
                              data_nf - 1, "\n"
                exit(1)
            end

            data[column] = ""
        end

The rest of the code is identical to the code in the --extract handler.

        data.each do |column|
            if column == ""
                next
            else
                line += column + ","
            end
        end

        print line[0, line.length-1], "\n"

        old_data_n = data_n
    end
end

We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.

Make csvt Executable

Your script is working now and you could call it quits, but for greater convenience in the future, try to make an extra effort and make csvt executable, so you can type just this:

$ csvt

instead of this:

$ ruby csvt.rb

If you are using Unix, simply add this code on the first line of your script:

#!/usr/local/bin/ruby

The actual path to the ruby interpreter binary might be different on your system. The easiest way to find out is to use the locate or which command:

$ locate ruby
$ which ruby

If either fails, use find

$ find / -name "ruby"

This might take a while because find is searching the whole directory tree. Once you know the access path to the ruby binary, paste it after #! and save the script to disk. Remember that you need place these instructions on the very first line of your script or the shell will not be able to recognize it as a request to use the Ruby interpreter. If you need to list options for the interpreter, you can list them, but remember that there is no need to list the name of the script itself.

Now save csvt to disk, and make it executable with $ chmod u+x csvt.

The u+x argument tells chmod to mark csvt as executable only by the owner of the script (that would be you ...). Other possibilities include g+x, which marks the script as executable by all members of the group that the script is assigned to (ls -l reveals the script's group); o+x, which would make the script executable by all other users (not a good idea); finally, a+x would make it executable by all users (this should be avoided as well).

Note that neither the #! notation nor chmod command can be used in the Microsoft Windows environment unless you install the Cygwin package, which turns Windows into a pretty good Unix environment look-and-feel-alike. When installing Cygwin is not an option, you can still use csvt, but it must be preceded with the ruby command, as in ruby csvt -e file instead of csvt -e file.

Resources

The following places should be on the list of favorite destinations for everyone learning and using Ruby:

Books

If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.

Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.