Skip Menu |

This queue is for tickets about the Text-Unidecode CPAN distribution.

Report information
The Basics
Id: 8017
Status: open
Priority: 0/
Queue: Text-Unidecode

People
Owner: Nobody in particular
Requestors: SREZIC [...] cpan.org
Cc: cjm [...] cpan.org
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: Text::Unidecode for any charset
It would be nice if Text::Unidecode could decode into any arbitrary charset, not only ascii. A sample, probably inefficient implemention could look like this: use Text::Unidecode; use Encode qw(encode); use charnames qw(:full); $tocharset = "iso-8859-1"; $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; $res = ""; for (split //, $x) { my $conv = encode($tocharset, $_); if ($_ ne "?" && $conv eq "?") { $res .= unidecode($_); } else { $res .= $conv; } } print $res, "\n"; __END__ Regards, Slaven
From: srezic [...] cpan.org
[SREZIC - Thu Oct 14 16:15:12 2004]: Show quoted text
> It would be nice if Text::Unidecode could decode into any arbitrary > charset, not only ascii. A sample, probably inefficient implemention > could look like this: >
[...] This is slightly better, as it does not require "?" being the substitution character: use Text::Unidecode; use Encode qw(encode); use charnames qw(:full); $tocharset = "iso-8859-1"; $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; $res = ""; for (split //, $x) { my $conv = eval { encode($tocharset, $_, Encode::FB_CROAK) }; if ($@) { $res .= unidecode($_); } else { $res .= $conv; } } print $res, "\n"; __END__
From: perl [...] cjmweb.net
On Thu Jun 23 05:54:16 2005, SREZIC wrote: Show quoted text
> > It would be nice if Text::Unidecode could decode into any arbitrary > > charset, not only ascii. A sample, probably inefficient implemention > > could look like this:
I haven't benchmarked it, but I'll bet this is a lot faster: use Text::Unidecode; use Encode 2.12 qw(encode _utf8_off); # need v2.12 to support coderef use charnames qw(:full); my $tocharset = "iso-8859-1"; my $x = "\xfc\x{20ac}\N{HORIZONTAL ELLIPSIS}\N{LEFT DOUBLE QUOTATION MARK}"; my $res = encode($tocharset, $x, sub { my $ascii = unidecode(chr $_[0]); _utf8_off($ascii); $ascii }); print $res, "\n"; __END__ Technically, it probably shouldn't be using _utf8_off, but I get weird results if I call encode there. I don't think it's reentrant. Of course, all these versions assume that $tocharset is some form of extended ASCII.
On Thu Oct 14 16:15:12 2004, SREZIC wrote: Show quoted text
> It would be nice if Text::Unidecode could decode into any arbitrary > charset, not only ascii.
0c93623 adds this feature via exportable unidecode_to_charset. It is implemented basically as perl@cjmweb.net did it. https://github.com/doherty/Text- Unidecode/commit/0c93623e2c003646f9f62f585638c68f9f8de4a1