Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 4 (September 1995)

Perl has a substantial collection of data recognizing and capturing operators and features. In this column, I'm going to build a parser for a small somewhat free-form database, like a mailing list.

First, let's take a look at some sample data:

        Name: Randal L. Schwartz
        Company: Stonehenge Consulting Services
        Street: 4470 SW Hall Suite 107
        City: Beaverton
        State: Oregon
        Zip: 97005
        Phone: 503-777-0095
        
        Name: John Big-booty
        City: San Angeles
        State: California
        Zip: 93021
        Phone: 291-555-2213
        
        Company: Lips, Inc.
        Street: 4221 Wayback Lane
        City: Springfield
        State: Kansas
        Zip: 65554

Each entry is delimited from the next by a blank line. Note that not all fields are present in each entry, so we'll have to consider that when we're reading it in to a database. (Please note that these entries are for illustrative purposes only -- any similarity to actual persons, living or dead, is merely a coincidence.)

I'm going to put the data into a list of associative arrays. (This is not possible in earlier versions of Perl, so if these examples don't work, make sure you have Perl version 5.000 or later.) Each associative array represents one entry from the database. The keys of the associative array are the names of the fields (like ``Name'' or ``City'') with the values being the corresponding data.

First, we'll need to break the data up into entries. The simplest way is to take advantage of ``paragraph mode'' while reading the file. In paragraph mode, each ``line'' read from the file is actually any number of lines up to the next blank line or end of file. (Isn't it convenient that there's a Perl mode to read this kind of data? Some would probably accuse me of selecting this structure to match the Perl mode, but I'm not telling. :-) Paragraph mode is selected by setting $/ to an empty string:

        #!/usr/bin/perl
        $/ = "";
        while (<>) {
                ## one entry per $_ here
        }

Wow, we're almost done (not!). The text in $_ will now look like a number of lines that are separated by \n. We need to parse this data, which is most easily done using multi-line match mode. Multi-line match mode (indicated by a trailing m on the regular expression) allows the caret (^) to match not only the beginning of the string, but also just after any embedded newline in the string. Now, to grab the Name, Street, and City fields, it'd look like this:

        #!/usr/bin/perl
        $/ = "";
        while (<>) {
                %entry = (); # initialize empty entry
                if (/^Name: (.*)/m) {
                        $entry{"Name"} = $1;
                }
                if (/^Street: (.*)/m) {
                        $entry{"Street"} = $1;
                }
                if (/^City: (.*)/m) {
                        $entry{"City"} = $1;
                }
                ## save %entry here
        }

As you can see, the code for parsing the entry is looking rather repetitive, and I haven't even done all of the fieldnames yet. This was my first approach, so let me throw it away and try something more general.

It's easier to think of the data as ``field: value'', and grab both at the same time. Let's try that direction:

        #!/usr/bin/perl
        $/ = "";
        while (<>) {
                # for each entry:
                %entry = (); # initialize
                foreach (split /\n/) {
                        # for each line in entry:
                        $entry{$1} = $2
                                if /^(.*): (.*)$/;
                }
                ## save %entry here
        }

Ahh! This is getting close. However, there's a bug here. Can you tell what it is before reading ahead? Nope? Well, suppose the value contains a colon, as in:

        Company: White Elephants: A division of Trunks-R-Us

The first .* will match all the way up to the second colon, because regular expression quantifiers are greedy by default. To fix that, just put a ? after that first colon, which says to match as few characters as possible (be ``stingy'') rather than as many as possible (``greedy''). (To save space, I won't repeat the program part -- just look for the change in later revisions.)

Note that the order of data, or omission of some of the fields, is irrelevant. I could put the city before the name, or whatever.

Now I have a valid %entry for each entry. All I have to do is save it. I can't save the actual associative array into a list, but I can save a pointer to the associative array using a reference. At first thought, you might think to do this:

        #!/usr/bin/perl
        $/ = "";
        @entries = (); # master list
        while (<>) {
                # for each entry:
                %entry = (); # initialize
                foreach (split /\n/) {
                        # for each line in entry:
                        $entry{$1} = $2
                                if /^(.*): (.*)$/;
                }
                # save %entry to @entries:
                push @entries, \%entry;
        }

And this will indeed make a list of references to associative arrays. However, it's a list of references to the same associative array, %entry! What we need instead is to make a new anonymous associative array for each of the entries. There's a couple of ways we can do that. One is to use the data from %entry inside the initialization of a brand new associative array:

        #!/usr/bin/perl
        $/ = "";
        @entries = (); # master list
        while (<>) {
                # for each entry:
                %entry = (); # initialize
                foreach (split /\n/) {
                        # for each line in entry:
                        $entry{$1} = $2
                                if /^(.*): (.*)$/;
                }
                # save %entry to @entries:
                push @entries, {%entry};
        }

The other, probably spiffier way is to create a new anonymous associative array to begin with, and get rid of %entry altogether:

        #!/usr/bin/perl
        $/ = "";
        @entries = (); # master list
        while (<>) {
                # for each entry:
                $ref_entry = {}; # anon hash
                foreach (split /\n/) {
                        # for each line in entry:
                        $$ref_entry{$1} = $2
                                if /^(.*): (.*)$/;
                }
                # save this entry to @entries:
                push @entries, $ref_entry;
        }

I like this one better, even though the syntax of de-referencing the reference to the anonymous associative array gets slightly uglier in the middle of the loop. The syntax $$ref_entry{$1} says to use $ref_entry as the ``name'' of an associative array, and look at the key for $1 in that associative array.

Whichever you prefer, we now have what we started out to get -- a list of references to (anonymous) associative arrays representing the original data entries.

For my first trick, I'll print the data in a mailing list form. To do this, I'll have to walk through the data, looking for things with complete addresses. Let's give it a whirl:

        #!/usr/bin/perl
        [parsing code from above goes here]
        foreach $ref (@entries) {
                $name = $$ref{"Name"};
                $company = $$ref{"Company"};
                $street = $$ref{"Street"};
                $city = $$ref{"City"};
                $state = $$ref{"State"};
                $zip = $$ref{"Zip"};
                next unless defined $address;
                print "$name\n$street\n";
                print "$city, $state $zip\n";
        }

Here, I'm looking for entries that have an address, because there are entries in my database for which I have only a phone number. Those funky $$ref{"Something"} things are once again using $ref as the ``name'' of an associative array. Recall that each element of the @entries list is a reference to an associative array.

The entries here will come out in the order that I've defined them. What if I want them in zip code order? No problem (well, no major problem) -- just sort the entries list before using it:

        #!/usr/bin/perl
        [parsing code from above]
        @entries = sort {
                $$a{"Zip"} <=> $$b{"Zip"}
        } @entries;
        [printing code from above]

Here, I'm using a user-defined (that'd be me) sort to rearrange the contents of the @entries list. A user-defined sort routine is handed two elements of the list to be sorted (here, @entries) as $a and $b. Since each of the elements are references to an associative array, I can dereference them to get the ``Zip'' field from each one.

If I wanted something more complicated, like state first, and then city within state, it's just a small matter of typing a little more:

        #!/usr/bin/perl
        [parsing code from above]
        @entries = sort {
                $$a{"State"} cmp $$b{"State"} or
                $$a{"City"} cmp $$b{"City"}
        } @entries;
        [printing code from above]

Here, the left part of the or operator will be evaluated first. If the cmp returns -1 or +1, then the right part of the or operator can be skipped. This happens when the states differ. If the states are the same, then the right part of the or operator must be evaluated (because the left part is 0, which is false), yielding the comparison between cities (a difficult task to do in real life).

For my last trick, I'll give you the entire program that prints only the phone numbers for persons or companies for which we have phone numbers, sorted by phone number:

        #!/usr/bin/perl
        $/ = "";
        @entries = (); # master list
        while (<>) {
                # for each entry:
                $ref_entry = {}; # anon hash
                foreach (split /\n/) {
                        # for each line in entry:
                        $$ref_entry{$1} = $2
                                if /^(.*): (.*)$/;
                }
                # save this entry to @entries:
                push @entries, $ref_entry;
        }
        @entries = sort {
                $$a{"Phone"} cmp $$b{"Phone"}
        } @entries;
        foreach $ref (@entries) {
                $phone = $$ref{"Phone"};
                next unless defined $phone;
                $name = $$ref{"Name"};
                $company = $$ref{"Company"};
                print "$name ($company) $phone\n";
        }

This is just built up from pieces from the previous snippets, hacked just slightly to meet the new goal.

Perl's richness of language features, while possibly initially intimidating, can yield tremendous flexibility in the long run if you take the time to explore them. Hopefully, you've seen a few new cool tricks in this column. Keep reading for future cool tricks and basic techniques.

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 4 (September 1995)