perl | Useful bit of code to mark a string as UTF8

You're viewing

perl
Create a Dreamwidth Account Learn More

Reload page in style: site light

marnanel posting in

perl

When you have data in a string, Perl remembers the encoding the string's in. If you grab UTF-8 stuff out of a database or from HTTP parameters, it doesn't know what the encoding is, and it will get it wrong. This function returns the strings you passed it concatenated and marked as UTF-8:

sub mark_utf8 { pack "U0C*", unpack "C*", join('',@_); }

Flat | Top-Level Comments Only

From:

sophie

Actually, the better way to do this, on Perl v5.8 or higher, is:

use Encode;

sub mark_utf8 { return decode("UTF-8", shift); }

This is because Perl's internal "utf8" encoding is very slightly different from regular "UTF-8" in subtle ways. (I don't know all the differences, but one is that Perl is more lax in the way it works.) The Encode module knows how to handle these changes and will always give you what you want.

Similarly, to unmark:

use Encode;

sub unmark_utf8 { return encode("UTF-8", shift); }

From:

marnanel

Oh, thanks a lot-- that's far better.

Flat | Top-Level Comments Only

Profile

Pathologically Eclectic Rubbish Lister

perl.com

August 2012

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Page Summary

sophie - (no subject)

Style Credit

Style: by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Mar. 27th, 2026 05:37 pm