Subject: | MARC::File::USMARC gets tripped up if fields contain 0x1D |
Date: | Tue, 9 Aug 2011 15:12:24 +0100 |
To: | <bug-MARC-Record [...] rt.cpan.org> |
From: | "PHILLIPS M.E." <m.e.phillips [...] durham.ac.uk> |
I have been using the MARC::Record Perl module to process some MARC
records exported from Millennium. For some reason, a few records
actually have the character 0x1D as part of field values, not just as an
end of record marker. These can occur because Millennium extends the
multi-byte character encoding of CJK to allow arbitrary 16-bit Unicode
characters to appear. We mainly see this with directional quotes pasted
into our records by cataloguers.
Anyhow, MARC::File::USMARC gets tripped up by this because in "sub next"
the record is read by setting $/ to 0x1D and reading a "line" from the
file:
local $/ = END_OF_RECORD;
my $usmarc = <$fh>;
I found that by replacing those two lines with the following I was able
to overcome the problem:
my $length;
read($fh, $length, 5) || return;
return unless $length>=5;
my $record;
read($fh, $record, $length-5) || return;
my $usmarc = $length.$record;
This works by reading the first five bytes of the record, which signify
the record length, and then reading the remaining number of bytes as
stipulated by the record length.
Perhaps you might consider incorporating this change into the next
version of MARC::File::USMARC?
Other than this minor niggle, I find the MARC::Record module to be a
really powerful tool: great stuff!
Matthew
--
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941