Bug #49528 for App-SD: publish --html HTML still doesn't handle non-ASCII correctly

Tue Sep 08 15:53:38 2009 nelhage [...] mit.edu - Ticket created

Subject:	publish --html HTML still doesn't handle non-ASCII correctly
Date:	Tue, 8 Sep 2009 15:53:10 -0400
To:	bug-App-SD [...] rt.cpan.org
From:	Nelson Elhage <nelhage [...] MIT.EDU>

c.f. http://nelhage.com/sd/barnowl/ticket/36934c17-1c54-5678-988b-9b16f177160d/view.html It looks fine using 'sd server'.

Thu Sep 10 04:35:15 2009 jesse [...] fsck.com - Correspondence added

Subject:	Re: [rt.cpan.org #49528] publish --html HTML still doesn't handle non-ASCII correctly
Date:	Thu, 10 Sep 2009 04:35:07 -0400
To:	Nelson Elhage via RT <bug-App-SD [...] rt.cpan.org>
From:	Jesse Vincent <jesse [...] fsck.com>

Just fixed in prophet git. On Tue 8.Sep'09 at 15:53:39 -0400, Nelson Elhage via RT wrote: Show quoted text

> Tue Sep 08 15:53:38 2009: Request 49528 was acted upon. > Transaction: Ticket created by nelhage > Queue: App-SD > Subject: publish --html HTML still doesn't handle non-ASCII correctly > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: nelhage@mit.edu > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=49528 > > > > c.f. http://nelhage.com/sd/barnowl/ticket/36934c17-1c54-5678-988b-9b16f177160d/view.html > > It looks fine using 'sd server'. >

Thu Sep 10 04:35:17 2009 The RT System itself - Status changed from 'new' to 'open'

Thu Sep 10 11:02:33 2009 nelhage [...] mit.edu - Correspondence added

Still not quite right for some reason. I re-published, and even though the content-type is now set, it looks like the unicode characters are getting output as a byte stream, not a character stream -- something is converting the multibyte character into a sequence of HTML entities representing the underlying bytes. Re-published version at http://nelhage.com/sd/barnowl/ticket/36934c17-1c54-5678-988b-9b16f177160d/view.html

Thu Sep 17 18:12:59 2009 jesse [...] fsck.com - Correspondence added

CC:	undisclosed-recipients: ;
Subject:	Re: [rt.cpan.org #49528] publish --html HTML still doesn't handle non-ASCII correctly
Date:	Thu, 17 Sep 2009 11:58:48 -0400
To:	Nelson Elhage via RT <bug-App-SD [...] rt.cpan.org>
From:	Jesse Vincent <jesse [...] fsck.com>

At this point, I suspect it's a bug in our HTML parsing chain. that path should definitely be rewritten as it's insanely slow and a C dep for a core feature. On Thu 10.Sep'09 at 11:02:34 -0400, Nelson Elhage via RT wrote: Show quoted text

> Queue: App-SD > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=49528 > > > Still not quite right for some reason. I re-published, and even though > the content-type is now set, it looks like the unicode characters are > getting output as a byte stream, not a character stream -- something is > converting the multibyte character into a sequence of HTML entities > representing the underlying bytes. > > Re-published version at > http://nelhage.com/sd/barnowl/ticket/36934c17-1c54-5678-988b-9b16f177160d/view.html >

Thu Jan 06 09:03:05 2011 spang [...] mit.edu - Correspondence added

I also believe that this is a bug in our HTML parsing chain. I can reproduce with HTML::TreeBuilder 3.23, but not with the recently-released 4.1. I'm going to bump our HTML::TreeBuilder dep and close this bug. Spang

Thu Jan 06 09:03:06 2011 spang [...] mit.edu - Status changed from 'open' to 'resolved'