Bug #72659 for HTML-Scrubber: utf8 issues

RT for rt.cpan.org

This queue is for tickets about the HTML-Scrubber CPAN distribution.

Report information

The Basics

Id:	72659
Status:	rejected
Priority:	0/
Queue:	HTML-Scrubber

People

Owner:	Nobody in particular
Requestors:	JIRA [...] cpan.org
Cc:
AdminCc:

Bug Information

Severity:	(no value)
Broken in:	(no value)
Fixed in:	(no value)

History Show all quoted text

Wed Nov 23 06:55:00 2011 JIRA [...] cpan.org - Ticket created

Subject:

utf8 issues

There seem to be an issue with scrubbing utf8 encoded html. The returned data are not in perl internal encoding so one have to to decode on it.

Wed Nov 23 08:18:57 2011 nigel.metheringham [...] gmail.com - Correspondence added

I just wrote a test for this and am not seeing issues... which quite likely means I do not understanding things correctly since UTF tends to be subtle and vengeful! Could you send me a failing test for this - it will make it much easier to fix, and show that its fixed. Failing that, some sample code. Nigel.

Wed Nov 23 08:18:58 2011 The RT System itself - Status changed from 'new' to 'open'

Tue Feb 07 16:01:44 2012 nigel.metheringham [...] gmail.com - Correspondence added

Still awaiting some failure examples for this - if the input string is correctly labeled as utf8 then there should be no issues. If, however, you have a byte string with utf8 content you are lying about the character sets to the code and nasty things may happen - in that sort of case you should set the input filehandle encoding or explicitly d/encode the string. Intending to close this off unless I get some form of further info as I cannot reproduce an issue.

Sat Dec 22 14:39:56 2012 nigel.metheringham [...] gmail.com - Correspondence added

Tests I have run are showing that the module is utf clean, and no response from original reporter giving any further information regarding the bug.

Sat Dec 22 14:39:57 2012 nigel.metheringham [...] gmail.com - Status changed from 'open' to 'rejected'