Bug #77363 for XML-Writer: ENCODING is misleading, xmlDecl() needs its own new() setter

Tue May 22 12:30:48 2012 dmuey [...] cpan.org - Ticket created

Subject:

ENCODING is misleading, xmlDecl() needs its own new() setter

Regarding these two realities: 1. An ENCODING of 'utf-8' double encodes utf-8 data (see “see double encoding” below for details). 2. xmlDecl() can only be set via ENCODING or by argument. It should be clearer that you only need ENCODING when you want your data massaged when written to an OUTPUT that is a file handle. (i.e. when being written to a scalar there is no IOLayer involved so its not double encoded). It'd be nice to be able to set the default encoding for xmlDecl() in new() without corrupting the data. Suggestion: If 'IOLYAER' was passed: a. Treat its value like ENCODING is currently treated as far as binmode goes b. use ENCODING for xmlDecl only. or add an option for new() that sets the default for xmlDecl() but does not touch OUTPUT”s IO::Layer (would need to check for conflict, e.g. one is ascci and the other is utf-8) or add an option to new() to not binmode() the handle even if ENCODING was passed (no conflict resolution). Current workarounds: a. do not pass ENCODING to new() && call xmlDecl() w/ an argument of "utf-8". b. pass ENCODING so it is set for xmlDecl() && call “binmode $output;” after new() so the handle itself is set back to the :raw iolayer. [ -- “see double encoding” -- ] 1, Taking the example from the synopsis, adding an xmlDecl() call, and adding some utf8 bytes to the characters() call (i.e. curly quotes)l my $writer = XML::Writer->new(OUTPUT => $output); $writer->xmlDecl(); … $writer->characters("Hello, “world”!"); correctly results in: <?xml version="1.0"?> <greeting class="simple">Hello, “world”!</greeting> 2. adding ENCODING=>"utf-8" to new() makes xmlDecl() work as expected but garbles the quotes: <?xml version="1.0" encoding="utf-8"?> <greeting class="simple">Hello, âworldâ!</greeting> 3. not having encoding but calling xmlDecl w/ "UTF-8" works as expected: <?xml version="1.0" encoding="UTF-8"?> <greeting class="simple">Hello, “world”!</greeting> 4. having ENCODING=>"utf-8", 'binmode $output;' right after new(), and a call to xmlDecl w/ no arg has the same expected and correct result as #3: <?xml version="1.0" encoding="utf-8"?> <greeting class="simple">Hello, “world”!</greeting>

Wed May 23 08:57:06 2012 joe [...] kafsemo.org - Correspondence added

On Tue May 22 12:30:48 2012, DMUEY wrote: ... Show quoted text

> 1. An ENCODING of 'utf-8' double encodes utf-8 data (see “see double > encoding” below for details).

... Show quoted text

> 2. adding ENCODING=>"utf-8" to new() makes xmlDecl() work as expected > but garbles the quotes:

Thanks for reporting this. Getting character encoding right is one of the aims of XML::Writer. XML/examples/writing-unicode.pl is an example in the distribution that works for me - it declares utf-8 and outputs a stream with Unicode characters correctly encoded as utf-8 (i.e., the left quotation mark is encoded as the three-byte sequence $'\xE2\x80\x9C'). Does this work in your environment? If not, I'd have to know more about your environment - Perl version, character encoding, etc.

Wed May 23 08:57:08 2012 The RT System itself - Status changed from 'new' to 'open'

Wed May 23 09:11:24 2012 dmuey [...] cpan.org - Correspondence added

On Wed May 23 08:57:06 2012, JOSEPHW wrote: Show quoted text

> On Tue May 22 12:30:48 2012, DMUEY wrote: > ...

> > 1. An ENCODING of 'utf-8' double encodes utf-8 data (see “see double > > encoding” below for details).

> ...

> > 2. adding ENCODING=>"utf-8" to new() makes xmlDecl() work as expected > > but garbles the quotes:

> > Thanks for reporting this. Getting character encoding right is one of > the aims of XML::Writer. > > XML/examples/writing-unicode.pl is an example in the distribution that > works for me - it declares utf-8 and outputs a stream with Unicode > characters correctly encoded as utf-8 (i.e., the left quotation mark is > encoded as the three-byte sequence $'\xE2\x80\x9C'). > > Does this work in your environment? If not, I'd have to know more about > your environment - Perl version, character encoding, etc.

That does highlight the problem: that example starts w/ Unicode strings (AKA codepoints, wide chars, etc) *not* utf-8 bytes strings: my $unicodeString = "\x{201C}This\x{201D} is a test - \$ \x{00A3} \x{20AC}"; Change that to a utf-8 bytes string and it will be double encoded w/ ENCODING => utf-8. It should only set the IO Layer if the data contains Unicode strings that need to be re-encoded as utf-8, again, this report if about having utf-8 strings to begin with: my $utf8_bytes_string = "“This” is a test - \$ £ €"; my $utf8_bytes_string = "\xE2\x80\x9CThis\xE2\x80\x9D is a test - \$ \xC2\xA3 \xE2\x82\xAC"; HTH!

Wed May 23 10:15:51 2012 joe [...] kafsemo.org - Correspondence added

On Wed May 23 09:11:24 2012, DMUEY wrote: ... Show quoted text

> That does highlight the problem: that example starts w/ Unicode > strings (AKA codepoints, wide chars, etc) *not* utf-8 bytes strings:

Yes; the API is intended to be in terms of Unicode strings. This seemed like the best way to encourage correct encoding of the output. Otherwise, byte strings with other encodings could pass into output with an inappropriate utf-8 declaration. Show quoted text

> It should only set the IO Layer if the data contains Unicode strings > that need to be re-encoded as utf-8, again, this report if about > having utf-8 strings to begin with: > > my $utf8_bytes_string = "“This” is a test - \$ £ €"; > my $utf8_bytes_string = "\xE2\x80\x9CThis\xE2\x80\x9D is a test - \$ > \xC2\xA3 \xE2\x82\xAC";

Is the overhead of utf8::decode acceptable? Otherwise your first workaround looks like what I'd do.

Wed May 23 11:16:19 2012 webmaster [...] simplemood.com - Correspondence added

CC:	dmuey [...] cpan.org
Subject:	Re: [rt.cpan.org #77363] ENCODING is misleading, xmlDecl() needs its own new() setter
Date:	Wed, 23 May 2012 10:16:09 -0500
To:	bug-XML-Writer [...] rt.cpan.org
From:	Dan Muey <webmaster [...] simplemood.com>

On May 23, 2012, at 9:15 AM, Joseph Walton via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=77363 > > > On Wed May 23 09:11:24 2012, DMUEY wrote: > ...

>> That does highlight the problem: that example starts w/ Unicode >> strings (AKA codepoints, wide chars, etc) *not* utf-8 bytes strings:

> > Yes; the API is intended to be in terms of Unicode strings. This seemed > like the best way to encourage correct encoding of the output.

Sure, the POD needs to say that though (or maybe I missed it until I figured it out). ENCODING made me think I was telling it what my data was not what I wanted it turned into. We avoid Unicode strings for the confusion it causes and just do simple utf-8 bytes all the time everywhere (no warnings, no decode/encode, no io layers, etc etc), fiddling w/ the Unicode version in a small scope if we need character symantics like the number of characters. Show quoted text

> Otherwise, byte strings with other encodings could pass into output with > an inappropriate utf-8 declaration. >

>> It should only set the IO Layer if the data contains Unicode strings >> that need to be re-encoded as utf-8, again, this report if about >> having utf-8 strings to begin with: >> >> my $utf8_bytes_string = "“This” is a test - \$ £ €"; >> my $utf8_bytes_string = "\xE2\x80\x9CThis\xE2\x80\x9D is a test - \$ >> \xC2\xA3 \xE2\x82\xAC";

> > Is the overhead of utf8::decode acceptable?

utf8 pragma is problematic, I'd say to avoid that at all costs :) Show quoted text

> Otherwise your first > workaround looks like what I'd do.

I'd say simply: a. make the ENCODING POD very clear that it is only when the data coming in are Unicode strings. b. If we can just have something like DATA_ENCODING => "utf-8" in new() that is used for xmlDecl() and does not do anything w/ the IO Layer then I'd simply use 'DATA_ENCODING' instead of 'ENCODING' (if both are given I'd say fail since they are mutually ambiguous) or: ENCODING_IN => "utf-8" ENCODING_OUT => "utf-8" then only re-encode (i.e. the IO-Layer) if they are different. something like that that doesn't break existing code using Unicode strings but allows for utf-8 bytes to be utf-8 bytes.

Tue May 29 10:20:47 2012 joe [...] kafsemo.org - Correspondence added

On Wed May 23 11:16:19 2012, webmaster@simplemood.com wrote: Show quoted text

> > Is the overhead of utf8::decode acceptable?

> utf8 pragma is problematic, I'd say to avoid that at all costs :)

Okay., although it's worth noting that utf8::decode is available without using the utf8 pragma. Show quoted text

> I'd say simply: > > a. make the ENCODING POD very clear that it is only when the data > coming in are Unicode strings.

I've modified the POD to: ENCODING A character encoding to use for the output; currently this must be one of 'utf-8' or 'us-ascii'. If present, it will be used for the underlying character encoding and as the default in the XML declaration. All character data should be passed as Unicode strings when an encoding is set. Show quoted text

> b. If we can just have something like > DATA_ENCODING => "utf-8" in new() that is used for xmlDecl() and > does not do anything w/ the IO Layer

It seems like the use of that property would be to set the default for a later call to xmlDecl; I'm not convinced of the value of a new configuration parameter with potentially confusing semantics when you can pass the same string to 'xmlDecl' directly.

Tue May 29 10:49:19 2012 dmuey [...] cpan.org - Correspondence added

Show quoted text

> > > Is the overhead of utf8::decode acceptable?

> > utf8 pragma is problematic, I'd say to avoid that at all costs :)

> > Okay., although it's worth noting that utf8::decode is available without > using the utf8 pragma.

Right, I was also referring to the functions associated with it. Also, we don’t want to fiddle w/ Unicode strings so decoding bytes string into unicode strings is way to much work only to change it back (via IO layer) with m ore work. Show quoted text

> > I'd say simply: > > > > a. make the ENCODING POD very clear that it is only when the data > > coming in are Unicode strings.

> > I've modified the POD to: > > ENCODING > A character encoding to use for the output; currently this must > be one of 'utf-8' or 'us-ascii'. If present, it will be used > for the underlying character encoding and as the default in the > XML declaration. All character data should be passed as > Unicode strings when an encoding is set.

Great! Show quoted text

> > b. If we can just have something like > > DATA_ENCODING => "utf-8" in new() that is used for xmlDecl() and > > does not do anything w/ the IO Layer

> > It seems like the use of that property would be to set the > default for a later call to xmlDecl; I'm not convinced of the value of a > new configuration parameter with potentially confusing semantics > when you can pass the same string to 'xmlDecl' directly.

Then how about: NO_IOLAYER => 1, so ENCODING works like it does w/ xmlDecl() but does not mangle the data when its already utf-8? anything to not mangle the data and not have to maintain arguments to xmlDecl() calls.

Sun Jun 03 07:07:47 2012 joe [...] kafsemo.org - Correspondence added

On Tue May 29 10:49:19 2012, DMUEY wrote: Show quoted text

> > I'm not convinced of the value of a > > new configuration parameter with potentially confusing semantics > > when you can pass the same string to 'xmlDecl' directly.

> > Then how about: > > NO_IOLAYER => 1, so ENCODING works like it does w/ xmlDecl() but does > not mangle the data when its already > utf-8? > > anything to not mangle the data and not have to maintain arguments to > xmlDecl() calls.

Okay, I think I see where you're coming from: you're not happy with the performance overhead of encoding/decoding Unicode when you know it's redundant and you don't trust Perl's Unicode layer to do the right thing. I'm sympathetic, but I don't think I want to encourage that style with users who may not have as much understanding of the implications. The overhead of re-specifying 'utf-8' doesn't seem like too much maintenance, but you could also try a custom subclass of XML::Writer where you can override xmlDecl to use 'utf-8' as the default. Sorry I couldn't be more help on this one, and thanks for helping to improve the documentation.

Sun Jun 03 07:07:49 2012 joe [...] kafsemo.org - Status changed from 'open' to 'resolved'