Skip Menu |

This queue is for tickets about the HTML-WikiConverter CPAN distribution.

Report information
The Basics
Id: 13561
Status: resolved
Worked: 3 hours (180 min)
Priority: 0/
Queue: HTML-WikiConverter

People
Owner: diberri [...] cpan.org
Requestors: brouard [...] ined.fr
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.30
Fixed in: (no value)



Subject: UTF-8 not iso8859-1
Hi, Many thanks for your HTML::WikiConverter, very impressive. Unfortunately, I haven't found anything about coding and UTF-8. When using your Web form and calling an external URL (HTML in UTF-8) to be converted in wiki, I do not really get my "e-acute"s in the Wiki output area but the double 8 bits characters corresponding to e-acute in UTF-8. That's not fully operational but not wrong. Now when I use your html2wiki perl script on the same raw html file, I get "démographiques" for "démographiques", i.e the translation (from probably iso-8859-1) of both 8 bits characters into Atilde and copy. I tried to look at the source code but I don't know where the transformation is done. I mean during which phases. Also, where can I specify that the source is in UTF-8 and not iso-8859-1 . Any environmemt? or meta in the html source? Many thanks, Nicolas
From: brouard [...] ined.fr
[guest - Thu Jul 7 03:23:07 2005]: Show quoted text
> Hi, > > Many thanks for your HTML::WikiConverter, very impressive. > > Unfortunately, I haven't found anything about coding and UTF-8. > > When using your Web form and calling an external URL (HTML in UTF-8) > to be converted in wiki, I do not really get my "e-acute"s in the > Wiki output area but the double 8 bits characters corresponding to > e-acute in UTF-8. That's not fully operational but not wrong. > > Now when I use your html2wiki perl script on the same raw html file, I > get "démographiques" for "démographiques", i.e the > translation (from probably iso-8859-1) of both 8 bits characters > into Atilde and copy. > > I tried to look at the source code but I don't know where the > transformation is done. I mean during which phases. > > Also, where can I specify that the source is in UTF-8 and not iso- > 8859-1 . Any environmemt? or meta in the html source? > > Many thanks, >
> Nicolas>
Looking into more details at the sources code, it appears that HTML::entities does only transform into ISO-8859-1 with decode (from HTML entities to ISO-8859-1 numerics) and encode. The only solution, without a additional %entity2char tables is to use iconv: iconv -f UTF-8 -t ISOS8859-1 foo-utf8.htm >foo-88591.htm before processing it via html2wiki. In order to reencode into UTF-8 you can use postprocess_output: sub postprocess_output { my( $self, $outref ) = @_; $$outref =~ s/é/é/g; $$outref =~ s/è/è/g; $$outref =~ s/ê/ê/g; $$outref =~ s/à/à/g; $$outref =~ s/â/â/g; $$outref =~ s/ç/ç/g; $$outref =~ s/î/î/g; $$outref =~ s/ô/ô/g; $$outref =~ s/ù/ù/g; etc. } Cheers, Nicolas
[guest - Fri Jul 8 02:40:41 2005]: Show quoted text
> > Looking into more details at the sources code, it appears that > HTML::entities does only transform into ISO-8859-1 with decode (from > HTML entities to ISO-8859-1 numerics) and encode.
HTML::entities now mentions that the problem is in Perl prior to 5.6, so upgrading perl to 5.6 should allow correct escaping of unicode.
[guest - Thu Jul 7 03:23:07 2005]: Show quoted text
> Unfortunately, I haven't found anything about coding and UTF-8.
Multiple encodings are supported in 0.40. The default is to treat input as utf8. Show quoted text
> I tried to look at the source code but I don't know where the > transformation is done. I mean during which phases.
It's done in the __encode_entities() method in HTML::WikiConverter. Show quoted text
> Also, where can I specify that the source is in UTF-8 and not iso- > 8859-1 . Any environmemt? or meta in the html source?
Try again in version 0.40, passing the --encoding to html2wiki, e.g. html2wiki --encoding utf8 input.html Or in the HTML::WikiConverter constructor: my $wc = new HTML::WikiConverter( dialect => 'MediaWiki', encoding => 'utf8' ); I'm releasing 0.40 as we speak. Please test it and let me know if there's any way H::WC's encoding support can be improved. -- David Iberri