Bug #8017 for Text-Unidecode: Text::Unidecode for any charset

Thu Oct 14 16:15:12 2004 SREZIC [...] cpan.org - Ticket created

Subject:

Text::Unidecode for any charset

It would be nice if Text::Unidecode could decode into any arbitrary charset, not only ascii. A sample, probably inefficient implemention could look like this: use Text::Unidecode; use Encode qw(encode); use charnames qw(:full); $tocharset = "iso-8859-1"; $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; $res = ""; for (split //, $x) { my $conv = encode($tocharset, $_); if ($_ ne "?" && $conv eq "?") { $res .= unidecode($_); } else { $res .= $conv; } } print $res, "\n"; __END__ Regards, Slaven

Thu Jun 23 05:54:16 2005 SREZIC [...] cpan.org - Correspondence added

From:

srezic [...] cpan.org

[SREZIC - Thu Oct 14 16:15:12 2004]: Show quoted text

> It would be nice if Text::Unidecode could decode into any arbitrary > charset, not only ascii. A sample, probably inefficient implemention > could look like this: >

[...] This is slightly better, as it does not require "?" being the substitution character: use Text::Unidecode; use Encode qw(encode); use charnames qw(:full); $tocharset = "iso-8859-1"; $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; $res = ""; for (split //, $x) { my $conv = eval { encode($tocharset, $_, Encode::FB_CROAK) }; if ($@) { $res .= unidecode($_); } else { $res .= $conv; } } print $res, "\n"; __END__

Fri May 25 15:52:16 2007 cjm [...] cpan.org - Correspondence added

From:

perl [...] cjmweb.net

On Thu Jun 23 05:54:16 2005, SREZIC wrote: Show quoted text

> > It would be nice if Text::Unidecode could decode into any arbitrary > > charset, not only ascii. A sample, probably inefficient implemention > > could look like this:

I haven't benchmarked it, but I'll bet this is a lot faster: use Text::Unidecode; use Encode 2.12 qw(encode _utf8_off); # need v2.12 to support coderef use charnames qw(:full); my $tocharset = "iso-8859-1"; my $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; my $res = encode($tocharset, $x, sub { my $ascii = unidecode(chr $_[0]); _utf8_off($ascii); $ascii }); print $res, "\n"; __END__ Technically, it probably shouldn't be using _utf8_off, but I get weird results if I call encode there. I don't think it's reentrant. Of course, all these versions assume that $tocharset is some form of extended ASCII.

Fri May 25 15:52:17 2007 The RT System itself - Status changed from 'new' to 'open'

Fri May 25 15:55:18 2007 cjm [...] cpan.org - Cc CJM added

Mon Jan 24 21:58:06 2011 DOHERTY [...] cpan.org - Correspondence added

On Thu Oct 14 16:15:12 2004, SREZIC wrote: Show quoted text

> It would be nice if Text::Unidecode could decode into any arbitrary > charset, not only ascii.

0c93623 adds this feature via exportable unidecode_to_charset. It is implemented basically as perl@cjmweb.net did it. https://github.com/doherty/Text- Unidecode/commit/0c93623e2c003646f9f62f585638c68f9f8de4a1