Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 43486
Status: open
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: drieux [...] wetware.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.32
Fixed in: (no value)



Subject: XML::Twig keep_encoding option fails if previous XML::Twig found a special case.
I have attached a tarball that contains the problem_case.t, as well as other demonstration code, and associated test_data files. The problem arises only if one needs to use the keep_encoding option. We need to use it for processing XML documents that can contain special unicode characters, such as bullets, as well as XML that may have been saved in a file format other than utf-8. The problem occurs if an XML::Twig is constructed without the keep_encoding set, and the twig->parse() is called on content that contains the XML element <Chapter>. Subsequent XML::Twigs, which need to have the keep_encoding set will fail to process special characters properly: # got: '<Chapter>The Bullet• you install:</Chapter>' # expected: '<Chapter>The Bullet• you install:</Chapter>' I have not had time to resolve why it is the <Chapter> causes this problem. But it appears to be uniq, since <Chapters>, <chapter> and <Chapte> will not cause the error.
Subject: xml_twig_error.tar
Download xml_twig_error.tar
application/x-tar 40k

Message body not shown because it is not plain text.

On Fri Feb 20 22:25:01 2009, DRIEUX wrote: Show quoted text
> I have attached a tarball that contains the problem_case.t, as well as > other demonstration code, and associated test_data files. > > The problem arises only if one needs to use the keep_encoding option. > > We need to use it for processing XML documents that can contain special > unicode characters, such as bullets, as well as XML that may have been > saved in a file format other than utf-8. > > The problem occurs if an XML::Twig is constructed without the > keep_encoding set, and the twig->parse() is called on content that > contains the XML element <Chapter>. > > Subsequent XML::Twigs, which need to have the keep_encoding set will > fail to process special characters properly: > > # got: '<Chapter>The Bullet• you install:</Chapter>' > # expected: '<Chapter>The Bullet• you install:</Chapter>' > > I have not had time to resolve why it is the <Chapter> causes this > problem. But it appears to be uniq, since <Chapters>, <chapter> and > <Chapte> will not cause the error.
I am sorry I did not reply before, but for some reason I did not get the RT notification. I looked into it, and it looks like a tricky one. Encoding problems often are. I'll keep looking at it, and see if I can figure something out. __ mirod
On Fri Feb 20 22:25:01 2009, DRIEUX wrote: Show quoted text
> I have attached a tarball that contains the problem_case.t, as well as > other demonstration code, and associated test_data files. > > The problem arises only if one needs to use the keep_encoding option. > > We need to use it for processing XML documents that can contain special > unicode characters, such as bullets, as well as XML that may have been > saved in a file format other than utf-8. > > The problem occurs if an XML::Twig is constructed without the > keep_encoding set, and the twig->parse() is called on content that > contains the XML element <Chapter>. > > Subsequent XML::Twigs, which need to have the keep_encoding set will > fail to process special characters properly: > > # got: '<Chapter>The Bullet• you install:</Chapter>' > # expected: '<Chapter>The Bullet• you install:</Chapter>' > > I have not had time to resolve why it is the <Chapter> causes this > problem. But it appears to be uniq, since <Chapters>, <chapter> and > <Chapte> will not cause the error.
This is one of the weirdest bug I have ever seen. Even <d><e>The Bullet• you install:</e><Chapter/></d> causes the problem, but anything other spelling for Chapter doesn't. I am a bit stumped. I have no idea why 'Chapter' would be special and cause the string to be treated any different than any other tag. Especially as the problem occurs even *before* the Chapter tag is parsed, as you can see if you add a handler to output the element content during the parsing: my $twig2 = XML::Twig->new(%params, twig_handlers => { _all_ => sub { warn "DUMP: ", $_->gi, ": ", $_->sprint }}); There are so many stars that need to be exactly aligned for this bug to occur that I think it's either something completely obvious that I can't see for now, or a random interaction between expat at the C level and some perl encoding oddity, triggered by XML::Twig's clumsy attempts at working sort of the same both in utf8 mode and natively with other encodings. __ mirod
Subject: Re: [rt.cpan.org #43486] XML::Twig keep_encoding option fails if previous XML::Twig found a special case.
Date: Thu, 7 May 2009 16:07:57 -0700
To: bug-XML-Twig [...] rt.cpan.org
From: drieux <drieux [...] wetware.com>
On May 7, 2009, at 10:55 AM, MIROD via RT wrote: Show quoted text
[..] Show quoted text
> This is one of the weirdest bug I have ever seen. Even <d><e>The > Bullet• > you install:</e><Chapter/></d> causes the problem, but anything other > spelling for Chapter doesn't. > > I am a bit stumped. I have no idea why 'Chapter' would be special and > cause the string to be treated any different than any other tag.
That was the sheer madness that I ran into. Also why I logged the bug, since, well, it is replicatable. Show quoted text
> Especially as the problem occurs even *before* the Chapter tag is > parsed, as you can see if you add a handler to output the element > content during the parsing: > my $twig2 = XML::Twig->new(%params, twig_handlers => { _all_ => > sub { warn "DUMP: ", $_->gi, ": ", $_->sprint }});
HUM..... Show quoted text
> There are so many stars that need to be exactly aligned for this bug to > occur that I think it's either something completely obvious that I > can't > see for now, or a random interaction between expat at the C level and > some perl encoding oddity, triggered by XML::Twig's clumsy attempts at > working sort of the same both in utf8 mode and natively with other > encodings.
that was my presumption, but at the time i was working on a 'single source document' system, that used <Chapter> for every chapter in a book, so I had no time then to do the work on it. What I will do is see if I can get any time to go into the process and figure out if there is more detail there. ciao drieux ----
Subject: Re: [rt.cpan.org #43486] XML::Twig keep_encoding option fails if previous XML::Twig found a special case.
Date: Fri, 08 May 2009 18:00:39 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
So I got a little further. The problem doesn't happen when I replace the call to parse (on a string) by a call to parsefile, on the original file. So I displayed the utf8'edness of the strings before being parsed (checked with Encode::is_utf8), and the result is below (note that I changed Chapter to tChapter in the first test): the problem comes when the string want_round_trip contains utf8 characters and it has the utf8 flag... still weird. I don't know why it loses the utf8 flag in the first test, and in any case I would think that the test would fail when it does NOT have the flag, not when it has it. At least it looks like if you parse from the file you don't have a problem. Do you find this to be the case when you do this? # string is NOT utf8 # xml: '<tChapter>The Bullet• you install:</tChapter>' ok 1 - roundTripSafe - xml from noHeading.xml equals sprint from twig root # round trip: '<tChapter>The Bullet• you install:</tChapter>' # want_round_trip string is NOT utf8 ok 2 - roundTripSafe - noHeading.xml after processing 'specialCase' # string is NOT utf8 # xml: '<d><e>The Bullet• you install:</e><Chapter>The Bullet• you install:</Chapter></d>' ok 3 - roundTripSafe - xml from heading.xml equals sprint from twig root # round trip: '<d><e>The Bullet• you install:</e><Chapter>The Bullet• you install:</Chapter></d>' # want_round_trip string is utf8 not ok 4 - roundTripSafe - heading.xml after processing 'specialCase' # Failed test 'roundTripSafe - heading.xml after processing 'specialCase'' # at problem_case.t line 124. # got: '<d><e>The Bullet⢠you install:</e><Chapter>The Bullet⢠you install:</Chapter></d>' # expected: '<d><e>The Bullet• you install:</e><Chapter>The Bullet• you install:</Chapter></d>' # string is NOT utf8 # xml: '<foo><Chapter>&lt;PsuedoElement> start&#160;finish</Chapter></foo>' ok 5 - roundTripSafe - xml from compound.xml equals sprint from twig root # round trip: '<foo><Chapter>&lt;PsuedoElement> start&#160;finish</Chapter></foo>' # want_round_trip string is utf8 ok 6 - roundTripSafe - compound.xml after processing 'specialCase' 1..6
Subject: Re: [rt.cpan.org #43486] XML::Twig keep_encoding option fails if previous XML::Twig found a special case.
Date: Fri, 8 May 2009 15:33:06 -0700
To: bug-XML-Twig [...] rt.cpan.org
From: drieux <drieux [...] wetware.com>
On May 8, 2009, at 9:01 AM, xmltwig@gmail.com via RT wrote: Show quoted text
> The problem doesn't happen when I replace the call to parse (on a > string) by a > call to parsefile, on the original file.
ok, not sure about that. So that you understand a bit more of the process we used, there are Test::Class based modules that would need to create XML from 'strings' with parse() because these did not need to actually have a file, so we were not using parsefile(). Show quoted text
> So I displayed the utf8'edness of the strings before being parsed > (checked with > Encode::is_utf8), and the result is below (note that I changed Chapter > to > tChapter in the first test): the problem comes when the string > want_round_trip > contains utf8 characters and it has the utf8 flag... still weird. I > don't know > why it loses the utf8 flag in the first test, and in any case I would > think that > the test would fail when it does NOT have the flag, not when it has it.
ouch. What ever is happening there seems to be the majik that needs to be fixed. Show quoted text
> At least it looks like if you parse from the file you don't have a > problem. Do > you find this to be the case when you do this?
I am not on the that project at this moment, but a part of the problem is going to be the need to use both parse() and parsefile() and still have the data round trip safe. I can not recall what our work around was, it may have been that all data read from files went through parsefile(). ciao drieux ----