Perl.com

Perl Unicode Cookbook: Unicode Normalization

By Tom Christiansen on May 18, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 27: Unicode normalization

Prescription one reminded you to always decompose and recompose Unicode data at the boundaries of your application. Unicode::Normalize can do much more for you. It supports multiple Unicode Normalization Forms.

Normalization, of course, takes Unicode data of arbitrary forms and canonicalizes it to a standard representation. (Where a composite character may be composed of multiple characters, normalized decomposition arranges those characters in a canonical order. Normalized composition combines those characters to a single composite character, where possible. Without this normalization, you can imagine the difficulty of determining whether one string is logically equivalent to another.)

Typically, you should render your data into NFD (the canonical decomposition form) on input and NFC (canonical decomposition followed by canonical composition) on output. Using NFKC or NFKD functions improves recall on searches, assuming you've already done the same normalization to the text to be searched.

Note that this normalization is about much more than just splitting or joining pre-combined compatibility glyphs; it also reorders marks according to their canonical combining classes and weeds out singletons.

 use Unicode::Normalize;
 my $nfd  = NFD($orig);
 my $nfc  = NFC($orig);
 my $nfkd = NFKD($orig);
 my $nfkc = NFKC($orig);

Perl Unicode Cookbook: Custom Character Properties

By Tom Christiansen on May 17, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 26: Custom character properties

Match Unicode Properties in Regex explained that ever Unicode character has one or more properties, specified by the Unicode consortium. You may extend these rule to define your own properties such that Perl can use them.

A custom property is a function given a name beginning with In or Is which returns a string conforming to a special format. The "User-Defined Character Properties" section of perldoc perlunicode describes this format in more detail.

To de?ne at compile-time your own custom character properties for use in regexes:

 # using private-use characters
 sub In_Tengwar { "E000\tE07F\n" }

 if (/\p{In_Tengwar}/) { ... }

 # blending existing properties
 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
 +utf8::IsLatin
 +utf8::IsGreek
 &utf8::IsTitle
 END_OF_SET

 if (/\p{Is_GraecoRoman_Title}/ { ... }

Perl Unicode Cookbook: Match Unicode Properties in Regex

By Tom Christiansen on May 16, 2012 6:00 AM

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 25: Match Unicode properties in regex with `\p`, `\P`

Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl's regex engine is aware of these properties; use the \p{} metacharacter sequence to match a codepoint possessing that property and its inverse, \P{} to match a codepoint lacking that property.

Each property has a short name and a long name. For example, to match any codepoint which has the Letter property, you may use \p{Letter} or \p{L}. Similarly, you may use \P{Uppercase} or \P{Upper}. perldoc perlunicode's "Unicode Character Properties" section describes these properties in greater detail.

Examples of these properties useful in regex include:

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower}
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}

Apr	MAY	Jun
	20
2011	2012	2013

news and views of the Perl programming language

Perl Unicode Cookbook: Unicode Normalization

? 27: Unicode normalization

Perl Unicode Cookbook: Custom Character Properties

? 26: Custom character properties

Perl Unicode Cookbook: Match Unicode Properties in Regex

? 25: Match Unicode properties in regex with `\p`, `\P`

Sponsored by

Recent Entries

Monthly Archives

news and views of the Perl programming language

Perl Unicode Cookbook: Unicode Normalization

? 27: Unicode normalization

Perl Unicode Cookbook: Custom Character Properties

? 26: Custom character properties

Perl Unicode Cookbook: Match Unicode Properties in Regex

? 25: Match Unicode properties in regex with \p, \P

Sponsored by

Recent Entries

Monthly Archives

? 25: Match Unicode properties in regex with `\p`, `\P`