Bug #77108 for HTML-Parser: [HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities

Thu May 10 09:30:21 2012 v.virvilis [...] biovista.com - Ticket created

CC:	Miltiadis Koutsokeras <m.koutsokeras [...] biovista.com>
Subject:	[HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities
Date:	Thu, 10 May 2012 16:30:02 +0300
To:	bug-HTML-Parser [...] rt.cpan.org
From:	Vassilis Virvilis <v.virvilis [...] biovista.com>

Hi, I am running debian unstable with html-parser 3.69 When the HTML::Entities decode_entities encounter the valid UTF-8 character CF87 (greek chi) it leaves him unchanged as it should be (input-correct). When the input files contains a html entity 𝒮 (input-bug) and CF87 then it correctly transforms the html entity but it also transforms CF87 to C38FC287 which is wrong. You can run the examples by $>./bug_html_decode_entities.pl < input-correct > output-correct $>./bug_html_decode_entities.pl < input-bug > output-bug Hope that helps best regards -- Show quoted text

__________________________________ Vassilis Virvilis Ph.D. Head of IT Biovista Inc. US Offices 2421 Ivy Road Charlottesville, VA 22903 USA T: +1.434.971.1141 F: +1.434.971.1144 European Offices 34 Rodopoleos Street Ellinikon, Athens 16777 GREECE T: +30.210.9629848 F: +30.210.9647606 www.biovista.com Biovista is a privately held biotechnology company that finds novel uses for existing drugs, and profiles their side effects using their mechanism of action. Biovista develops its own pipeline of drugs in CNS, oncology, auto-immune and rare diseases. Biovista is collaborating with biopharmaceutical companies on indication expansion and de-risking of their portfolios and with the FDA on adverse event prediction.

Message body is not shown because sender requested not to inline it.

Sun May 13 08:24:09 2012 GAAS [...] cpan.org - Correspondence added

The string passed to decode_entities() need to be decoded first to be come a proper Unicode string. When you read it directly from a file it's still encoded UTF-8.

$ perl -MHTML::Entities -MEncode -le 'print encode_utf8(decode_entities(decode_utf8("\xCF\x87𝒮")))'

You can ask perl to do this on input/output automatically with the -CS option. If you run:

$ perl -CS bug_html_decode_entities.pl <input-bug.txt

I belive you see the expected output (instead of the "bug").

Sun May 13 08:24:54 2012 The RT System itself - Status changed from 'new' to 'open'

Sun May 13 08:24:54 2012 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'

Mon May 14 03:19:17 2012 v.virvilis [...] biovista.com - Correspondence added

Subject:	Re: [rt.cpan.org #77108] [HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities
Date:	Mon, 14 May 2012 10:19:03 +0300
To:	bug-HTML-Parser [...] rt.cpan.org
From:	Vassilis Virvilis <v.virvilis [...] biovista.com>

On 13/05/2012 03:24 μμ, Gisle_Aas via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=77108>

Show quoted text

> $ perl -CS bug_html_decode_entities.pl<input-bug.txt > > I belive you see the expected output (instead of the "bug"). >

I can confirm that this works. I wasn't aware of the -CS family commands. Thank you very much for the insight. Vassilis

Mon May 14 03:19:18 2012 The RT System itself - Status changed from 'rejected' to 'open'

Tue May 15 17:19:05 2012 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'