marnanel: (Default)
Monument ([personal profile] marnanel) wrote in [community profile] perl2009-04-16 05:36 pm

Useful bit of code to mark a string as UTF8

When you have data in a string, Perl remembers the encoding the string's in. If you grab UTF-8 stuff out of a database or from HTTP parameters, it doesn't know what the encoding is, and it will get it wrong. This function returns the strings you passed it concatenated and marked as UTF-8:

sub mark_utf8 { pack "U0C*", unpack "C*", join('',@_); }
sophie: A cartoon-like representation of a girl standing on a hill, with brown hair, blue eyes, a flowery top, and blue skirt. ☀ (Default)

[personal profile] sophie 2009-04-17 10:53 am (UTC)(link)
Actually, the better way to do this, on Perl v5.8 or higher, is:

use Encode;

sub mark_utf8 { return decode("UTF-8", shift); }


This is because Perl's internal "utf8" encoding is very slightly different from regular "UTF-8" in subtle ways. (I don't know all the differences, but one is that Perl is more lax in the way it works.) The Encode module knows how to handle these changes and will always give you what you want.

Similarly, to unmark:

use Encode;

sub unmark_utf8 { return encode("UTF-8", shift); }