CC: | Miltiadis Koutsokeras <m.koutsokeras [...] biovista.com> |
Subject: | [HTML::Entities] BUG: decoding valid UTF-8 when decodes multiple entities |
Date: | Thu, 10 May 2012 16:30:02 +0300 |
To: | bug-HTML-Parser [...] rt.cpan.org |
From: | Vassilis Virvilis <v.virvilis [...] biovista.com> |
Hi,
I am running debian unstable with html-parser 3.69
When the HTML::Entities decode_entities encounter the valid UTF-8
character CF87 (greek chi) it leaves him unchanged as it should be
(input-correct).
When the input files contains a html entity 𝒮 (input-bug) and
CF87 then it correctly transforms the html entity but it also transforms
CF87 to C38FC287 which is wrong.
You can run the examples by
$>./bug_html_decode_entities.pl < input-correct > output-correct
$>./bug_html_decode_entities.pl < input-bug > output-bug
Hope that helps
best regards
--
Show quoted text
__________________________________
Vassilis Virvilis Ph.D.
Head of IT
Biovista Inc.
US Offices
2421 Ivy Road
Charlottesville, VA 22903
USA
T: +1.434.971.1141
F: +1.434.971.1144
European Offices
34 Rodopoleos Street
Ellinikon, Athens 16777
GREECE
T: +30.210.9629848
F: +30.210.9647606
www.biovista.com
Biovista is a privately held biotechnology company that finds novel uses
for existing drugs, and profiles their side effects using their
mechanism of action. Biovista develops its own pipeline of drugs in CNS,
oncology, auto-immune and rare diseases. Biovista is collaborating with
biopharmaceutical companies on indication expansion and de-risking of
their portfolios and with the FDA on adverse event prediction.
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.