Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 39849
Status: open
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: AZED [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.32
Fixed in: (no value)



Subject: XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8
I've attached a small test case that should illustrate the problem. If you leave :utf8 on the filehandle, the twig unicode characters get encoded twice, and end up garbled. If you take it off, the perl print output doesn't get encoded, and ends up garbled. You can get around this by not specifying output-encoding in new(), but if you do that, you lose your XML declaration and have to print it in by hand. The only way to make this work currently, it seems, is to make sure never to share filehandles, and make sure that any filehandles used for Twig output have no layer set, and any filehandles used for anything else do. I suspect this might be what bit the guy who reported #18284, but since he never wrote back, there's no way to tell for sure.
Subject: twig-unicode.pl
#!/usr/bin/perl use XML::Twig; use utf8; my $twig = XML::Twig->new( keep_atts_order => 1, output_encoding => 'utf-8', pretty_print => 'record' ); $twig->parse(\*DATA); my $greet = $twig->root->insert_new_elt('last_child','greeting'); $greet->set_text("Gr\x{00FC}\x{00DF} Dich!"); open(my $fh,">:encoding(utf8)",'twig-unicode-out.xml'); $twig->print(\*$fh); print {*$fh} "<copyright>Copyright \x{00A9} 2008 Me</copyright>\n"; close($fh); __DATA__ <root> </root>
Subject: Re: [rt.cpan.org #39849] XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8
Date: Wed, 08 Oct 2008 09:50:18 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
Zed Pobre via RT wrote: Show quoted text
> Mon Oct 06 21:38:36 2008: Request 39849 was acted upon. > Transaction: Ticket created by AZED > Queue: XML-Twig > Subject: XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8 > Broken in: 3.32 > Severity: Normal > Owner: Nobody > Requestors: AZED@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=39849 > > > > I've attached a small test case that should illustrate the problem. If > you leave :utf8 on the filehandle, the twig unicode characters get > encoded twice, and end up garbled. If you take it off, the perl print > output doesn't get encoded, and ends up garbled. > > You can get around this by not specifying output-encoding in new(), but > if you do that, you lose your XML declaration and have to print it in by > hand. > > The only way to make this work currently, it seems, is to make sure > never to share filehandles, and make sure that any filehandles used for > Twig output have no layer set, and any filehandles used for anything > else do.
I see the problem. output-encoding creates the xml declaration, but also a filter to convert the output to the proper encoding. By default what the parser gets is already in utf8, so nothing needs to be done on it. The filter should do nothing, and the binmode :utf8 should just keep the output from being downgraded to latin1 (which is what it does for the regular print). I have tried a couple of fixes so far, but they don't pass the regression tests. I'll keep you posted. Thanks -- mirod
Subject: Re: [rt.cpan.org #39849] XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8
Date: Tue, 14 Oct 2008 14:37:26 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
Zed Pobre via RT wrote: Show quoted text
> Mon Oct 06 21:38:36 2008: Request 39849 was acted upon. > Transaction: Ticket created by AZED > Queue: XML-Twig > Subject: XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8 > Broken in: 3.32 > Severity: Normal > Owner: Nobody > Requestors: AZED@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=39849 > > > > I've attached a small test case that should illustrate the problem. If > you leave :utf8 on the filehandle, the twig unicode characters get > encoded twice, and end up garbled. If you take it off, the perl print > output doesn't get encoded, and ends up garbled. > > You can get around this by not specifying output-encoding in new(), but > if you do that, you lose your XML declaration and have to print it in by > hand. > > The only way to make this work currently, it seems, is to make sure > never to share filehandles, and make sure that any filehandles used for > Twig output have no layer set, and any filehandles used for anything > else do. > > I suspect this might be what bit the guy who reported #18284, but since > he never wrote back, there's no way to tell for sure.
OK, I think I have it. You can try the development version, it passes both the test you sent (with the output file opened with ':utf8') and the regression tests, so it looks good. -- mirod
Very short answer for now is no, something isn't right. I installed today's build, ran the regression tests on my current project and something broke, and it works fine with yesterday's build. It looks like some output switched to Latin-1, but it will take me a little while to write you a simplified test case.
Okay, now that I've had the time to poke at it again, the problem is that XML::Twig is no longer doing the right thing when the filehandle *isn't* binmode :utf8 -- no conversion is being applied at all, even when encoding is set for utf-8. I'm attaching a Test::More case that simply writes and reads back an element with UTF8 text under three filehandle conditions -- you can see the results flip-flop between yesterday's build and today's.
use warnings; use strict; use utf8; binmode(STDOUT,':utf8'); binmode(STDERR,':utf8'); use Test::More tests => 7; binmode(Test::More->builder->failure_output,':utf8'); BEGIN { use_ok('XML::Twig') }; my $twig = XML::Twig->new( output_encoding => 'utf-8', pretty_print => 'record' ); my $u8string = "Gr\x{00FC}\x{00DF} Dich!"; utf8::upgrade($u8string); my $twigstring = "<twiggreet>" . $u8string . "</twiggreet>\n"; my $fh; $twig->parse($twigstring); open($fh,">:encoding(utf8)",'twig-unicode-openencutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openencutf8.xml'), 'output via ">:encoding(utf8)" is parseable'); is($twig->root->text,$u8string, 'output via ">:encoding(utf8)" is correct'); $twig->parse($twigstring); open($fh,">:utf8",'twig-unicode-openutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openutf8.xml'), 'output via ">:utf8" is parseable'); is($twig->root->text,$u8string, 'output via ">:utf8" is correct'); $twig->parse($twigstring); open($fh,">",'twig-unicode-open.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-open.xml'), 'output via ">" is parseable'); is($twig->root->text,$u8string, 'output via ">" is correct'); unlink('twig-unicode-openencutf8.xml'); unlink('twig-unicode-openutf8.xml'); unlink('twig-unicode-open.xml');
An additional thought: While poking at your code, I thought that perhaps I could solve the problem by creating an additional converter method using utf8::upgrade (which is idempotent) to use if $enc is utf-8, but I haven't been able to make it work. Inserting a debugging line, I can see the damn thing activating in all three cases, but it doesn't change the output, so there's something I fundamentally don't understand about how XML::Twig is using those filters. Maybe you can take that idea and run with it better than I did.
Scratch that last thought. The strings are already considered utf8 internally, which is why upgrade wasn't doing anything. I'm not quite sure what's going on, but my attempt to move the detection to inside encode_convert showed that is_utf8($_[0],1) always returned true for the user string and false for the XML declaration. I'm now thinking about forcing the binmode at the XML::Twig::Elt->print stage if the encoding is utf-8.
Got it. The attached patch (against the 2008.10.14 dev release) fixes the issue, at least for XML::Twig->print, and passes all unit tests. I have left the code that would have done the same for XML::Twig::Elt->print commented out, because it causes the IO::Scalar unit tests to break near the end, with the error: Can't locate object method "BINMODE" via package "IO::Scalar" Googling around a bit, this is almost certainly a bug in IO::Scalar, but I have no idea how to fix it, and given your other responses, I imagine you'd rather just wait it out rather than implement a fix to XML::Twig that triggers an IO::Scalar bug for existing users. I've rewritten the unit test file for this issue to include tests for XML::Twig::Elt->print, but has the two failing cases inside a TODO block, so it won't interrupt the normal build process until this issue gets resolved. I'll attach that to the next update. One final note: could you PLEASE have the 'speedup' script copy the existing Twig.pm to a backup, and document in the README that you have to edit Twig_pm.slow to get your changes to stick? I'm very happy I was regularly diffing out my changes into patch files, because I got a nasty surprise the first time I ran 'make test'.
--- XML-Twig-3.33/Twig_pm.slow 2008-10-14 06:45:01.000000000 -0400 +++ libxml-twig-perl-3.33~20081014/Twig_pm.slow 2008-10-14 22:43:22.000000000 -0400 @@ -2865,6 +2865,10 @@ sub print { my $t= shift; my $fh= _is_fh( $_[0]) ? shift : undef; + if(defined $fh) + { binmode $fh,":utf8" + if($t->{output_encoding} && $t->{output_encoding} =~ /^utf-8$/i); + } my %args= _normalize_args( @_); my $old_select = defined $fh ? select $fh : undef; @@ -7695,6 +7699,9 @@ my $pretty; my $fh= _is_fh( $_[0]) ? shift : undef; +# This breaks IO::Scalar tests, but that may be a bug in IO::Scalar +# if(defined $fh) { binmode $fh,":utf8" unless($output_filter); } + my $old_select= defined $fh ? select $fh : undef; my $old_pretty= defined ($pretty= shift) ? set_pretty_print( $pretty) : undef; $pretty ||=0;
The new unit test is attached here. And I'm an idiot because I missed the 'Add More Files' button right in front of my face. Ah well.
use strict; use utf8; use Test::More tests => 13; binmode(Test::More->builder->failure_output,':utf8'); BEGIN { use_ok('XML::Twig') }; my $twig = XML::Twig->new( output_encoding => 'utf-8', pretty_print => 'record' ); my $u8string = "Gr\x{00FC}\x{00DF} Dich!"; utf8::upgrade($u8string); my $twigstring = "<twiggreet>" . $u8string . "</twiggreet>\n"; utf8::upgrade($twigstring); utf8::upgrade($twigstring); my $fh; $twig->parse($twigstring); open($fh,">:encoding(utf8)",'twig-unicode-openencutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openencutf8.xml'), 'twig output via ">:encoding(utf8)" is parseable'); is($twig->root->text,$u8string, 'twig output via ">:encoding(utf8)" is correct'); $twig->parse($twigstring); open($fh,">:utf8",'twig-unicode-openutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openutf8.xml'), 'twig output via ">:utf8" is parseable'); is($twig->root->text,$u8string, 'twig output via ">:utf8" is correct'); $twig->parse($twigstring); open($fh,">",'twig-unicode-open.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-open.xml'), 'twig output via ">" is parseable'); is($twig->root->text,$u8string, 'twig output via ">" is correct'); $twig->parse($twigstring); open($fh,">:encoding(utf8)",'twigelt-unicode-openencutf8.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-openencutf8.xml'), 'twigelt output via ">:encoding(utf8)" is parseable'); is($twig->root->text,$u8string, 'twigelt output via ">:encoding(utf8)" is correct'); $twig->parse($twigstring); open($fh,">:utf8",'twigelt-unicode-openutf8.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-openutf8.xml'), 'twigelt output via ">:utf8" is parseable'); is($twig->root->text,$u8string, 'twigelt output via ">:utf8" is correct'); TODO: { local $TODO = "Fixing utf-8 output for XML::Twig::Elt->print breaks IO::Scalar tests"; $twig->parse($twigstring); open($fh,">",'twigelt-unicode-open.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-open.xml'), 'twigelt output via ">" is parseable'); is($twig->root->text,$u8string, 'twigelt output via ">" is correct'); } unlink('twig-unicode-openencutf8.xml'); unlink('twig-unicode-openutf8.xml'); unlink('twig-unicode-open.xml'); unlink('twigelt-unicode-openencutf8.xml'); unlink('twigelt-unicode-openutf8.xml'); unlink('twigelt-unicode-open.xml');
Subject: Re: [rt.cpan.org #39849] XML::Twig output-encoding utf-8 doesn't play nice with binmode :utf8
Date: Wed, 15 Oct 2008 10:31:39 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
Zed Pobre via RT wrote: Show quoted text
> Queue: XML-Twig > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=39849 > > > Okay, now that I've had the time to poke at it again, the problem is > that XML::Twig is no longer doing the right thing when the filehandle > *isn't* binmode :utf8 -- no conversion is being applied at all, even > when encoding is set for utf-8.
The goal is to be able to mix XML::Twig's print method and regular Perl print. This can be done by opening the filehandle with in :utf8 mode. So you have a way to get the output you want. Beyond that, I am not sure what I can do, especially as there is no way that I know of to determine what the IO layer applied to an open filehandle is (I might be wrong). If I understand correctly, all these complex tests that try to figure out whether the output filter needs or needs not be applied are there only to avoid you printing the XML declaration. Which you don't need if you're working in unicode. So I am a bit hesitant to go much further down that road for what I see as a not-so-compelling reason. Unicode is tricky in regular Perl. If you print a string to a filehandle that is not open as :utf8, then it will be converted to latin-1 (in order for older code to still work the same). Which is what will happen here, whether the print is done by XML::Twig or by a regular print. That's consistent. Bottom line, if you're working with utf8 and your filehandle is not open as :utf-8, you are just asking for trouble. XML::Twig cannot make things easier than they are in Perl. In short, you should open all of your output files with a :utf8 layer, and the code should work, and maybe drop the output_encoding option, which was created to allow output in non-utf8, IIRC in the days before 5.8, when encoding conversions weren't as easy as specifying the encoding layer when opening the file. I might add something about it, or even write a piece in the docs about how to handle encodings. I will look at your patch, but it doesn't seem to take into account the keep_encoding option, that keeps the original encoding of the XML, and which is a feature that's (still) quite popular. So I have to test it some more. -- mirod
Show quoted text
> If I understand correctly, all these complex tests that try to figure > out whether the output filter needs or needs not be applied are there > only to avoid you printing the XML declaration. Which you don't need > if you're working in unicode. So I am a bit hesitant to go much > further down that road for what I see as a not-so-compelling reason.
Oh, if you're only concerned about my personal problem, sure. As long as I can universally set :utf8 on my filehandles and stop thinking about them, I'd be perfectly happy. I went to the trouble of trying to improve your patch because after our previous conversation on the other bug report I thought that _you_ would be very unhappy to discover that what you had done changed the output between previous versions of XML::Twig and your latest version. I figured that the general goal here is to have XML::Twig with "output_encoding => utf-8" produce the correct output whether or not it is printing to a handle with the binmode layer set or not. You appear to be correct that there is no way to detect the presence of a binmode layer -- but you *can* force it, with minimal side effects, so that's what I did. Show quoted text
> Unicode is tricky in regular Perl. If you print a string to a > filehandle that is not open as :utf8, then it will be converted to > latin-1 (in order for older code to still work the same). Which is > what will happen here, whether the print is done by XML::Twig or by > a regular print. That's consistent.
You're aware that versions of XML::Twig prior to 20081014 *don't* convert to latin-1 when printing to a filehandle without a :utf8 layer set, right? Previously, if you set output_encoding => utf-8, and then printed to a filehandle with no output layer set, you got utf-8 output (you can see this by running the unit test I provided both against the current version and any prior version, including the stable version -- the 'twig output via ">"' tests pass). Show quoted text
> Bottom line, if you're working with utf8 and your filehandle is not > open as :utf-8, you are just asking for trouble.
Well, this is what I ordinarily would have thought, yes, but I'm a little surprised to hear you take that stance given your strong reaction to breaking backwards-compatibility in my other bug report. Proceeding with the 20081014 code as is will break the output of every program relying on output_encoding=>utf-8 to generate utf-8 output when printing to a filehandle with no layer. Show quoted text
> In short, you should open all of your output files with a :utf8 layer, > and the code should work, and maybe drop the output_encoding option, > which was created to allow output in non-utf8, IIRC in the days > before 5.8, when encoding conversions weren't as easy as specifying > the encoding layer when opening the file. I might add something about > it, or even write a piece in the docs about how to handle encodings.
This is a perfectly fine solution by me, and in fact, it tends towards the way that I would have been inclined to solve the problem, because it seems like the Perl Way to handle things, and because XML::Twig::Elt->print is looking like an increasingly sticky problem to solve. I've already comment-tagged all of the spots in my current project where Twig is printing to a filehandle just to make sure that I didn't accidentally :utf8 them in the future, so it wouldn't take me at all long to force them back to :utf8 and declare a dependency on Twig >= 3.34, or whatnot, but again, I'm a little surprised to hear you talking this way given your dedication to not changing existing behaviour. Maybe I shouldn't be trying to talk you out of it, though, since it would make my life much easier. :P If you drop the output_encoding option entirely, however, please at least replace it with 'encoding' or some other way to have set_encoding called automatically, so you can still request that XML declarations be printed via a single new(). Show quoted text
> I will look at your patch, but it doesn't seem to take into account > the keep_encoding option, that keeps the original encoding of the > XML, and which is a feature that's (still) quite popular. So I have > to test it some more.
Hm, this is true of the commented-out XML::Twig::Elt->print changes. That's hard to fix, unless there's some way of finding out whether keep_encoding was set on the twig containing the element from inside the element. It looks like you'd have to have every element inherit the keep_encoding attribute from its parent twig somehow. The XML::Twig->print section, however, assumes that if someone set output_encoding => utf-8 that they wouldn't also set keep_encoding and vice versa (the patch only triggers if output_encoding matches utf-8). Are you expecting that someone would try to use both output_encoding => utf-8 and keep_encoding => 1 at the same time? I've been playing around with this a bit today across but I made an interesting discovery when I went to update the patch to explicitly recognize keep_encoding: it doesn't matter if my code triggers or not -- if you set both output_encoding => utf-8 and keep_encoding => 1, data from another codepage, set either via set_text or imported via an external file, the text ends up not only utf-8 but *double-encoded* utf-8 even when the output filehandle has no layer. This holds all the way back to Twig 3.32. If you were intending keep_encoding to continue to function with output_encoding set, there's a deeper issue than what we've been working on. Personally, I think it would not be unwarranted to croak (or at least carp a warning) if someone set both of those options at the same time during a new(). I'm attaching the latest version of my patch, which makes explicit the XML::Twig->print handling of keep_encoding, and adds a note about keep_encoding to the commented-out addition in XML::Twig::Elt->print. I'm also attaching a new test file that explicitly tests for breakage using keep_encoding => 1, and also has a TODO test covering when keep_encoding => 1 and output_encoding = utf-8 are simultaneously set, just as an informative reference.
use strict; use utf8; use Cwd qw(chdir getcwd); use File::Basename qw(basename); use Test::More tests => 18; binmode(Test::More->builder->failure_output,':utf8'); binmode(Test::More->builder->todo_output,':utf8'); binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; BEGIN { use_ok('XML::Twig') }; # test 1 ok( (basename(getcwd()) eq 't') || chdir('t/'), # test 2 "Working in 't/" ) or die; my $twig = XML::Twig->new( output_encoding => 'utf-8', pretty_print => 'record' ); my $cp1252file = 'test_unicode_fh-1252_input.xml'; my $latin1string = "Gr\x{00FC}\x{00DF} Dich!"; my $u8string = "Gr\x{00FC}\x{00DF} Dich!"; utf8::upgrade($u8string); my $twigstring = "<greet>" . $u8string . "</greet>\n"; utf8::upgrade($twigstring); utf8::upgrade($twigstring); my $fh; my @result; $twig->parse($twigstring); open($fh,">:encoding(utf8)",'twig-unicode-openencutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openencutf8.xml'), # test 3 'twig output via ">:encoding(utf8)" is parseable'); is($twig->root->text,$u8string, # test 4 'twig output via ">:encoding(utf8)" is correct'); $twig->parse($twigstring); open($fh,">:utf8",'twig-unicode-openutf8.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-openutf8.xml'), # test 5 'twig output via ">:utf8" is parseable'); is($twig->root->text,$u8string, # test 6 'twig output via ">:utf8" is correct'); $twig->parse($twigstring); open($fh,">",'twig-unicode-open.xml') or die; $twig->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twig-unicode-open.xml'), # test 7 'twig output via ">" is parseable'); is($twig->root->text,$u8string, # test 8 'twig output via ">" is correct'); $twig->parse($twigstring); open($fh,">:encoding(utf8)",'twigelt-unicode-openencutf8.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-openencutf8.xml'), # test 9 'twigelt output via ">:encoding(utf8)" is parseable'); is($twig->root->text,$u8string, # test 10 'twigelt output via ">:encoding(utf8)" is correct'); $twig->parse($twigstring); open($fh,">:utf8",'twigelt-unicode-openutf8.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-openutf8.xml'), # test 11 'twigelt output via ">:utf8" is parseable'); is($twig->root->text,$u8string, # test 12 'twigelt output via ">:utf8" is correct'); TODO: { local $TODO = "Fixing utf-8 output for XML::Twig::Elt->print breaks IO::Scalar tests"; $twig->parse($twigstring); open($fh,">",'twigelt-unicode-open.xml') or die; print {*fh} $twig->prolog; $twig->root->print(\*$fh); close($fh) or die; ok($twig->safe_parsefile('twigelt-unicode-open.xml'), # test 13 'twigelt output via ">" is parseable'); is($twig->root->text,$u8string, # test 14 'twigelt output via ">" is correct'); } # Check for twig_keep_encoding not altering output on filehandles # with no layer $twig = XML::Twig->new(keep_encoding => 1); $twig->parse('<greet>Hi!</greet>'); $twig->root->set_text($latin1string); open($fh,">",'twig-keepenc-open.xml') or die; $twig->print(\*$fh); close($fh) or die; open($fh,"<",'twig-keepenc-open.xml') or die; { local $/; @result = <$fh>; } close($fh) or die; $result[0] =~ s/^\s+//; chomp($result[0]); is($result[0],"<greet>" . $latin1string . "</greet>", # test 15 'twig (keep_encoding) output via ">" is correct'); # Checking XML::Twig::Elt->print open($fh,">",'twigelt-keepenc-open.xml') or die; $twig->root->print(\*$fh); close($fh) or die; open($fh,"<",'twigelt-keepenc-open.xml') or die; { local $/; @result = <$fh>; } close($fh) or die; $result[0] =~ s/^\s+//; chomp($result[0]); is($result[0],"<greet>" . $latin1string . "</greet>", # test 16 'twigelt (keep_encoding) output via ">" is correct'); # Check for twig_keep_encoding not altering output on filehandles with # no layer even if both keep_encoding and output_encoding are set $twig = XML::Twig->new( output_encoding => 'utf-8', keep_encoding => 1, ); $twig->parsefile($cp1252file); open($fh,">",'twig-utf8keepenc-open.xml') or die; $twig->print(\*$fh); close($fh) or die; open($fh,"<",'twig-utf8keepenc-open.xml') or die; { local $/; @result = <$fh>; } close($fh) or die; # Split XML declaration and element # (There's no way to ensure that XML::Twig outputs this in two lines?) $result[0] =~ /^ (<.*?>) \s* (<.*>) $/x; $result[0] = $1; $result[1] = $2; is($result[0],'<?xml version="1.0" encoding="utf-8"?>', # test 17 'twig (keep_encoding+utf8) output via ">" has utf-8 XML declaration'); TODO: { local $TODO = "keep-encoding and output_encoding are mutually exclusive?"; is($result[1],"<greet>" . $u8string . "</greet>", # test 18 'twig (keep_encoding+utf8) output via ">" is correct'); } unlink('twig-unicode-openencutf8.xml'); unlink('twig-unicode-openutf8.xml'); unlink('twig-unicode-open.xml'); unlink('twig-keepenc-open.xml'); unlink('twig-utf8keepenc-open.xml'); unlink('twigelt-unicode-openencutf8.xml'); unlink('twigelt-unicode-openutf8.xml'); unlink('twigelt-unicode-open.xml'); unlink('twigelt-keepenc-open.xml');
<?xml version="1.0" encoding="windows-1252"?> <greet>Grüß Dich!</greet>
--- XML-Twig-3.33/Twig_pm.slow 2008-10-14 06:45:01.000000000 -0400 +++ libxml-twig-perl-3.33~20081014/Twig_pm.slow 2008-10-15 17:47:36.000000000 -0400 @@ -2865,6 +2865,11 @@ sub print { my $t= shift; my $fh= _is_fh( $_[0]) ? shift : undef; + if( defined $fh + && !$t->{twig_keep_encoding} + && $t->{output_encoding} + && ($t->{output_encoding} =~ /^utf-8$/i) ) + { binmode $fh,":utf8"; } my %args= _normalize_args( @_); my $old_select = defined $fh ? select $fh : undef; @@ -7695,6 +7700,11 @@ my $pretty; my $fh= _is_fh( $_[0]) ? shift : undef; +# This maintains output consistency with older versions, but breaks +# IO::Scalar tests, but that may be a bug in IO::Scalar +# It also breaks keep_encoding => 1 output +# if(defined $fh) { binmode $fh,":utf8" unless($output_filter); } + my $old_select= defined $fh ? select $fh : undef; my $old_pretty= defined ($pretty= shift) ? set_pretty_print( $pretty) : undef; $pretty ||=0;
On Wed Oct 15 04:31:33 2008, xmltwig@gmail.com wrote: Show quoted text
> ... > This can be done by opening the filehandle with in :utf8 mode. So you > have a way > to get the output you want. Beyond that, I am not sure what I can do, > especially > as there is no way that I know of to determine what the IO layer > applied to an > open filehandle is (I might be wrong). > ...
Saw this thread by accident - I think this is what you are looking for: http://search.cpan.org/~nwclark/perl-5.8.9/lib/PerlIO.pm#Querying_the_layers_of_filehandles I myself used this before, but can't seem to find the code in question right now. Cheers