このページは大阪弁化フィルタによって翻訳生成されたんですわ。

翻訳前ページへ


Perl.com
The Wayback Machine - http://web.archive.org/web/20120520134122/http://www.perl.com:80/

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 27: Unicode normalization

Prescription one reminded you to always decompose and recompose Unicode data at the boundaries of your application. Unicode::Normalize can do much more for you. It supports multiple Unicode Normalization Forms.

Normalization, of course, takes Unicode data of arbitrary forms and canonicalizes it to a standard representation. (Where a composite character may be composed of multiple characters, normalized decomposition arranges those characters in a canonical order. Normalized composition combines those characters to a single composite character, where possible. Without this normalization, you can imagine the difficulty of determining whether one string is logically equivalent to another.)

Typically, you should render your data into NFD (the canonical decomposition form) on input and NFC (canonical decomposition followed by canonical composition) on output. Using NFKC or NFKD functions improves recall on searches, assuming you've already done the same normalization to the text to be searched.

Note that this normalization is about much more than just splitting or joining pre-combined compatibility glyphs; it also reorders marks according to their canonical combining classes and weeds out singletons.

 use Unicode::Normalize;
 my $nfd  = NFD($orig);
 my $nfc  = NFC($orig);
 my $nfkd = NFKD($orig);
 my $nfkc = NFKC($orig);

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 26: Custom character properties

Match Unicode Properties in Regex explained that ever Unicode character has one or more properties, specified by the Unicode consortium. You may extend these rule to define your own properties such that Perl can use them.

A custom property is a function given a name beginning with In or Is which returns a string conforming to a special format. The "User-Defined Character Properties" section of perldoc perlunicode describes this format in more detail.

To de?ne at compile-time your own custom character properties for use in regexes:

 # using private-use characters
 sub In_Tengwar { "E000\tE07F\n" }

 if (/\p{In_Tengwar}/) { ... }

 # blending existing properties
 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
 +utf8::IsLatin
 +utf8::IsGreek
 &utf8::IsTitle
 END_OF_SET

 if (/\p{Is_GraecoRoman_Title}/ { ... }

Editor's note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.

? 25: Match Unicode properties in regex with \p, \P

Every Unicode codepoint has one or more properties, indicating the rules which apply to that codepoint. Perl's regex engine is aware of these properties; use the \p{} metacharacter sequence to match a codepoint possessing that property and its inverse, \P{} to match a codepoint lacking that property.

Each property has a short name and a long name. For example, to match any codepoint which has the Letter property, you may use \p{Letter} or \p{L}. Similarly, you may use \P{Uppercase} or \P{Upper}. perldoc perlunicode's "Unicode Character Properties" section describes these properties in greater detail.

Examples of these properties useful in regex include:

 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
 \p{Sk}, \p{Ps}, \p{Lt}
 \p{alpha}, \p{upper}, \p{lower}
 \p{Latin}, \p{Greek}
 \p{script=Latin}, \p{script=Greek}
 \p{East_Asian_Width=Wide}, \p{EA=W}
 \p{Line_Break=Hyphen}, \p{LB=HY}
 \p{Numeric_Value=4}, \p{NV=4}
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en