Perl.com

Perl Unicode Cookbook: Further Resources

By Tom Christiansen on June 29, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to have serialized this list over the past weeks.

This series has shown you several features of Unicode by example, as well as several techniques for working with Unicode correctly and easily with recent releases of Perl 5. By now you know more than many programmers do about Unicode, but your journey to mastery continues.

Perl 5 includes several pieces of documentation which explain Unicode and Perl's Unicode support. See perlunicode, perluniprops, perlre, perlrecharclass, perluniintro, perlunitut and perlunifaq.

Perl 5 and the CPAN provide several modules and distributions to allow the effective use of Unicode. As of Perl 5.16, many of these are in the core library. Many of them work just as well with earlier versions of Perl 5, though for the best and most correct support for Unicode as a whole, consider using Perl 5.14 or 5.16.

These modules include:

The CPAN distribution Unicode::Tussle module includes many command-line programs to help with working with Unicode, including these programs to fully or partly replace standard utilities: tcgrep instead of egrep, uniquote instead of cat -v or hexdump, uniwc instead of wc, unilook instead of look, unifmt instead of fmt, and ucsort instead of sort. For exploring Unicode character names and character properties, see its uniprops, unichars, and uninames programs. It also supplies these programs, all of which are general ?lters that do Unicode-y things: unititle and unicaps; uniwide and uninarrow; unisupers and unisubs; nfd, nfc, nfkd, and nfkc; and uc, lc, and tc.

Finally, see the published Unicode Standard (page numbers are from version 6.0.0), including these speci?c annexes and technical reports:

��3.13 Default Case Algorithms, page 113
��4.2 Case, pages 120-122
Case Mappings, page 166-172, especially Caseless Matching starting on page 170
UAX #44: Unicode Character Database
UTS #18: Unicode Regular Expressions
UAX #15: Unicode Normalization Forms
UTS #10: Unicode Collation Algorithm
UAX #29: Unicode Text Segmentation
UAX #14: Unicode Line Breaking Algorithm
UAX #11: East Asian Width

Tom Christiansen <tchrist@perl.com> wrote this series, with occasional kibbitzing from Larry Wall and Je?rey Friedl in the background.

Most of these examples came from the current edition of the "Camel Book"; that is, from the 4^th Edition of Programming Perl, Copyright ? 2012 Tom Christiansen et al., 2012-02-13 by O'Reilly Media. The code itself is freely redistributable, and you are encouraged to transplant, fold, spindle, and mutilate any of the examples in this series however you please for inclusion into your own programs without any encumbrance whatsoever. Acknowledgement via code comment is polite but not required.

Perl Unicode Cookbook: Demo of Unicode Collation and Printing

By Tom Christiansen on June 22, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 44: PROGRAM: Demo of Unicode collation and printing

The past several weeks of Unicode recipes have explained how Unicode works and shown how to use it in your programs. If you've gone through those recipes, you now understand more than most programmers.

How about putting everything together?

Here's a full program showing how to make use of locale-sensitive sorting, Unicode casing, and managing print widths when some of the characters take up zero or two columns, not just one column each time. When run, the following program produces this nicely aligned output (though the quality of the alignment depends on the quality of your Unicode font, of course):

    Cr?me Br?l?e....... ?2.00
    ?clair............. ?1.60
    Fideu?............. ?4.20
    Hamburger.......... ?6.00
    Jam?n Serrano...... ?4.45
    Lingui?a........... ?7.00
    P?t?............... ?4.15
    Pears.............. ?2.00
    P?ches............. ?2.25
    Sm?rbr?d........... ?5.75
    Sp?tzle............ ?5.50
    Xori?o............. ?3.00
    ��?�Ѧ�?.............. ?6.50
    ???............. ?4.00
    �����............. ?2.65
    �����߾Ƥ�......... ?8.00
    ���塼���꡼��..... ?1.85
    ����............... ?9.99
    ���............... ?7.50

Here's that program; tested on v5.14.

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting and unicode_strings
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "��?�Ѧ�?"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "lingui?a"          => 7.00, # spicy sausage, Portuguese
     "xori?o"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "?clair"            => 1.60, # dessert, French
     "sm?rbr?d"          => 5.75, # sandwiches, Norwegian
     "sp?tzle"           => 5.50, # Bayerisch noodles, little sparrows
     "���"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jam?n serrano"     => 4.45, # country ham, Spanish
     "p?ches"            => 2.25, # peaches, French
     "���塼���꡼��"    => 1.85, # cream-filled pastry like ?clair, Japanese
     "???"            => 4.00, # makgeolli, Korean rice wine
     "����"              => 9.99, # sushi, Japanese
     "�����"            => 2.65, # omochi, rice cakes, Japanese
     "cr?me br?l?e"      => 2.00, # tasty broiled cream, French
     "fideu?"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "p?t?"              => 4.15, # gooseliver paste, French
     "�����߾Ƥ�"        => 8.00, # okonomiyaki, Japanese
 );

 # find the widest allowed width for the name column
 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = Unicode::Collate::Locale->new( locale => "ja" );

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " ?%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str     =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

Simple enough, isn't it? Put together, everything just works nicely.

Perl Unicode Cookbook: Unicode Text in DBM Files (the easy way)

By Tom Christiansen on June 20, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 43: Unicode text in DBM hashes, the easy way

Some Perl libraries require you to jump through hoops to handle Unicode data. Would that everything worked as easily as Perl's open pragma!

For DBM files, here's how to implicitly manage the translation; all encoding and decoding is done automatically, just as with streams that have a particular encoding attached to them. The DBM_Filter module allows you to apply filters to keys and values to manipulate their contents before storing or fetching. The module includes a "utf8" filter. Use it like:

    use DB_File;
    use DBM_Filter;

    my $dbobj = tie %dbhash, "DB_File", "pathname";
    $dbobj->Filter_Value_Push("utf8");  # this is the magic bit

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    $dbhash{$uni_key} = $uni_value;

  # FETCH

    # $uni_key holds a normal Perl string (abstract Unicode)
    my $uni_value = $dbhash{$uni_key};

Jun	JUL	Aug
	05
2011	2012	2013

news and views of the Perl programming language

Perl Unicode Cookbook: Further Resources

Perl Unicode Cookbook: Demo of Unicode Collation and Printing

? 44: PROGRAM: Demo of Unicode collation and printing

Perl Unicode Cookbook: Unicode Text in DBM Files (the easy way)

? 43: Unicode text in DBM hashes, the easy way

Sponsored by

Recent Entries

Monthly Archives