Subject: | utf8 in MARC record not handled properly |
Position 9 of the MARC leader allows you to define Unicode as the
character set being used in the record. Vendors such as OCLC are moving
towards UTF-8 rather than MARC-8 for character representation.
Currently MARC::Record uses length() to calculate directory offsets, and
substr() to extract fields from the record based on the directory
offsets. This works fine for MARC-8 character encodings, but breaks once
a character can be more than one byte. A TODO test has been added to the
test suite which illustrates (utf8.t).
On the positive side, Jarkko indicates that 5.8.1 will have
bytes::substr() to complement 5.8.0's bytes::length(). Appropriate use
of these will be able to ensure MARC::Record can handle utf8 in MARC
data. But it will break backwards compatability. Perhaps a patch for a
utf8 safe MARC::Record distro will be the way to go.
--
From: Jarkko Hietaniemi <jhi@iki.fi>
To: ed-perluni@inkdroid.org, perl-unicode@perl.org
Subject: Re: bytes::substr() ?
Perl 5.8.1, whenever that happens, will have bytes::substr().
--
Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen