Perl.com

Perl Unicode Cookbook: Demo of Unicode Collation and Printing

By Tom Christiansen on June 22, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 44: PROGRAM: Demo of Unicode collation and printing

The past several weeks of Unicode recipes have explained how Unicode works and shown how to use it in your programs. If you've gone through those recipes, you now understand more than most programmers.

How about putting everything together?

Here's a full program showing how to make use of locale-sensitive sorting, Unicode casing, and managing print widths when some of the characters take up zero or two columns, not just one column each time. When run, the following program produces this nicely aligned output (though the quality of the alignment depends on the quality of your Unicode font, of course):

    Cr?me Br?l?e....... ?2.00
    ?clair............. ?1.60
    Fideu?............. ?4.20
    Hamburger.......... ?6.00
    Jam?n Serrano...... ?4.45
    Lingui?a........... ?7.00
    P?t?............... ?4.15
    Pears.............. ?2.00
    P?ches............. ?2.25
    Sm?rbr?d........... ?5.75
    Sp?tzle............ ?5.50
    Xori?o............. ?3.00
    ��?�Ѧ�?.............. ?6.50
    ???............. ?4.00
    �����............. ?2.65
    �����߾Ƥ�......... ?8.00
    ���塼���꡼��..... ?1.85
    ����............... ?9.99
    ���............... ?7.50

Here's that program; tested on v5.14.

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting and unicode_strings
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "��?�Ѧ�?"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "lingui?a"          => 7.00, # spicy sausage, Portuguese
     "xori?o"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "?clair"            => 1.60, # dessert, French
     "sm?rbr?d"          => 5.75, # sandwiches, Norwegian
     "sp?tzle"           => 5.50, # Bayerisch noodles, little sparrows
     "���"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jam?n serrano"     => 4.45, # country ham, Spanish
     "p?ches"            => 2.25, # peaches, French
     "���塼���꡼��"    => 1.85, # cream-filled pastry like ?clair, Japanese
     "???"            => 4.00, # makgeolli, Korean rice wine
     "����"              => 9.99, # sushi, Japanese
     "�����"            => 2.65, # omochi, rice cakes, Japanese
     "cr?me br?l?e"      => 2.00, # tasty broiled cream, French
     "fideu?"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "p?t?"              => 4.15, # gooseliver paste, French
     "�����߾Ƥ�"        => 8.00, # okonomiyaki, Japanese
 );

 # find the widest allowed width for the name column
 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = Unicode::Collate::Locale->new( locale => "ja" );

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " ?%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str     =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

Simple enough, isn't it? Put together, everything just works nicely.

Perl Unicode Cookbook: Unicode Text in DBM Files (the easy way)

By Tom Christiansen on June 20, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 43: Unicode text in DBM hashes, the easy way

Some Perl libraries require you to jump through hoops to handle Unicode data. Would that everything worked as easily as Perl's open pragma!

For DBM files, here's how to implicitly manage the translation; all encoding and decoding is done automatically, just as with streams that have a particular encoding attached to them. The DBM_Filter module allows you to apply filters to keys and values to manipulate their contents before storing or fetching. The module includes a "utf8" filter. Use it like:

    use DB_File;
    use DBM_Filter;

    my $dbobj = tie %dbhash, "DB_File", "pathname";
    $dbobj->Filter_Value_Push("utf8");  # this is the magic bit

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    $dbhash{$uni_key} = $uni_value;

  # FETCH

    # $uni_key holds a normal Perl string (abstract Unicode)
    my $uni_value = $dbhash{$uni_key};

Perl Unicode Cookbook: Unicode Text in Stubborn Libraries

By Tom Christiansen on June 18, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 42: Unicode text in DBM hashes, the tedious way

While Perl 5 has long been very careful about handling Unicode correctly inside the world of Perl itself, every time you leave the Perl internals, you cross a boundary at which something may need to handle decoding and encoding. This happens when performing IO across a network or to files, when speaking to a database, or even when using XS to use a shared library from Perl.

For example, consider the core module DB_File, which allows you to use Berkeley DB files from Perl—persistent storage for key/value pairs.

Using a regular Perl string as a key or value for a DBM hash will trigger a wide character exception if any codepoints won't ?t into a byte. Here's how to manually manage the translation:

    use DB_File;
    use Encode qw(encode decode);
    tie %dbhash, "DB_File", "pathname";

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = encode("UTF-8", $uni_value, 1);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode("UTF-8", $uni_key, 1);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode("UTF-8", $enc_key, 1);

By performing this manual encoding and decoding yourself, you know that your storage file will have a consistent representation of your data. The correct encoding depends on the type of data you store and the capabilities of the external code, of course.

May	JUN	Jul
	25
2011	2012	2013

news and views of the Perl programming language

Perl Unicode Cookbook: Demo of Unicode Collation and Printing

? 44: PROGRAM: Demo of Unicode collation and printing

Perl Unicode Cookbook: Unicode Text in DBM Files (the easy way)

? 43: Unicode text in DBM hashes, the easy way

Perl Unicode Cookbook: Unicode Text in Stubborn Libraries

? 42: Unicode text in DBM hashes, the tedious way

Sponsored by

Recent Entries

Monthly Archives