Bug #13561 for HTML-WikiConverter: UTF-8 not iso8859-1

Thu Jul 07 03:23:07 2005 Guest - Ticket created

Subject:

UTF-8 not iso8859-1

Hi, Many thanks for your HTML::WikiConverter, very impressive. Unfortunately, I haven't found anything about coding and UTF-8. When using your Web form and calling an external URL (HTML in UTF-8) to be converted in wiki, I do not really get my "e-acute"s in the Wiki output area but the double 8 bits characters corresponding to e-acute in UTF-8. That's not fully operational but not wrong. Now when I use your html2wiki perl script on the same raw html file, I get "dÃ©mographiques" for "démographiques", i.e the translation (from probably iso-8859-1) of both 8 bits characters into Atilde and copy. I tried to look at the source code but I don't know where the transformation is done. I mean during which phases. Also, where can I specify that the source is in UTF-8 and not iso-8859-1 . Any environmemt? or meta in the html source? Many thanks, Nicolas

Fri Jul 08 02:40:41 2005 Guest - Correspondence added

From:

brouard [...] ined.fr

[guest - Thu Jul 7 03:23:07 2005]: Show quoted text

> Hi, > > Many thanks for your HTML::WikiConverter, very impressive. > > Unfortunately, I haven't found anything about coding and UTF-8. > > When using your Web form and calling an external URL (HTML in UTF-8) > to be converted in wiki, I do not really get my "e-acute"s in the > Wiki output area but the double 8 bits characters corresponding to > e-acute in UTF-8. That's not fully operational but not wrong. > > Now when I use your html2wiki perl script on the same raw html file, I > get "dÃ©mographiques" for "dÃ©mographiques", i.e the > translation (from probably iso-8859-1) of both 8 bits characters > into Atilde and copy. > > I tried to look at the source code but I don't know where the > transformation is done. I mean during which phases. > > Also, where can I specify that the source is in UTF-8 and not iso- > 8859-1 . Any environmemt? or meta in the html source? > > Many thanks, >

> Nicolas>

Looking into more details at the sources code, it appears that HTML::entities does only transform into ISO-8859-1 with decode (from HTML entities to ISO-8859-1 numerics) and encode. The only solution, without a additional %entity2char tables is to use iconv: iconv -f UTF-8 -t ISOS8859-1 foo-utf8.htm >foo-88591.htm before processing it via html2wiki. In order to reencode into UTF-8 you can use postprocess_output: sub postprocess_output { my( $self, $outref ) = @_; $$outref =~ s/é/Ã©/g; $$outref =~ s/è/Ã¨/g; $$outref =~ s/ê/Ãª/g; $$outref =~ s/à/Ã /g; $$outref =~ s/â/Ã¢/g; $$outref =~ s/ç/Ã§/g; $$outref =~ s/î/Ã®/g; $$outref =~ s/ô/Ã´/g; $$outref =~ s/ù/Ã¹/g; etc. } Cheers, Nicolas

Wed Nov 09 16:57:43 2005 Guest - Correspondence added

[guest - Fri Jul 8 02:40:41 2005]: Show quoted text

> > Looking into more details at the sources code, it appears that > HTML::entities does only transform into ISO-8859-1 with decode (from > HTML entities to ISO-8859-1 numerics) and encode.

HTML::entities now mentions that the problem is in Perl prior to 5.6, so upgrading perl to 5.6 should allow correct escaping of unicode.

Mon Jan 09 13:08:10 2006 diberri [...] cpan.org - Correspondence added 120 min

[guest - Thu Jul 7 03:23:07 2005]: Show quoted text

> Unfortunately, I haven't found anything about coding and UTF-8.

Multiple encodings are supported in 0.40. The default is to treat input as utf8. Show quoted text

> I tried to look at the source code but I don't know where the > transformation is done. I mean during which phases.

It's done in the __encode_entities() method in HTML::WikiConverter. Show quoted text

> Also, where can I specify that the source is in UTF-8 and not iso- > 8859-1 . Any environmemt? or meta in the html source?

Try again in version 0.40, passing the --encoding to html2wiki, e.g. html2wiki --encoding utf8 input.html Or in the HTML::WikiConverter constructor: my $wc = new HTML::WikiConverter( dialect => 'MediaWiki', encoding => 'utf8' ); I'm releasing 0.40 as we speak. Please test it and let me know if there's any way H::WC's encoding support can be improved. -- David Iberri

Mon Jan 09 13:08:13 2006 diberri [...] cpan.org - Status changed from 'new' to 'open'

Mon Jan 09 13:08:15 2006 diberri [...] cpan.org - Given to DIBERRI

Sat Feb 25 19:21:10 2006 diberri [...] cpan.org - Status changed from 'open' to 'resolved'