Bug #70169 for MARC-Record: MARC::File::USMARC gets tripped up if fields contain 0x1D

Tue Aug 09 10:12:50 2011 m.e.phillips [...] durham.ac.uk - Ticket created

Subject:	MARC::File::USMARC gets tripped up if fields contain 0x1D
Date:	Tue, 9 Aug 2011 15:12:24 +0100
To:	<bug-MARC-Record [...] rt.cpan.org>
From:	"PHILLIPS M.E." <m.e.phillips [...] durham.ac.uk>

I have been using the MARC::Record Perl module to process some MARC records exported from Millennium. For some reason, a few records actually have the character 0x1D as part of field values, not just as an end of record marker. These can occur because Millennium extends the multi-byte character encoding of CJK to allow arbitrary 16-bit Unicode characters to appear. We mainly see this with directional quotes pasted into our records by cataloguers. Anyhow, MARC::File::USMARC gets tripped up by this because in "sub next" the record is read by setting $/ to 0x1D and reading a "line" from the file: local $/ = END_OF_RECORD; my $usmarc = <$fh>; I found that by replacing those two lines with the following I was able to overcome the problem: my $length; read($fh, $length, 5) || return; return unless $length>=5; my $record; read($fh, $record, $length-5) || return; my $usmarc = $length.$record; This works by reading the first five bytes of the record, which signify the record length, and then reading the remaining number of bytes as stipulated by the record length. Perhaps you might consider incorporating this change into the next version of MARC::File::USMARC? Other than this minor niggle, I find the MARC::Record module to be a really powerful tool: great stuff! Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941

Tue Aug 09 11:00:51 2011 GMCHARLT [...] cpan.org - Correspondence added

Hi, On Tue Aug 09 10:12:50 2011, m.e.phillips@durham.ac.uk wrote: Show quoted text

> I have been using the MARC::Record Perl module to process some MARC > records exported from Millennium. For some reason, a few records > actually have the character 0x1D as part of field values, not just as

an Show quoted text

> end of record marker. These can occur because Millennium extends the > multi-byte character encoding of CJK to allow arbitrary 16-bit Unicode > characters to appear. We mainly see this with directional quotes

pasted Show quoted text

> into our records by cataloguers.

Could you attach such a record for use as a test case? I also maintain MARC::Charset, so I'm also interested in the III character encoding extensions in general. Show quoted text

> This works by reading the first five bytes of the record, which

signify Show quoted text

> the record length, and then reading the remaining number of bytes as > stipulated by the record length. > > Perhaps you might consider incorporating this change into the next > version of MARC::File::USMARC?

Yes, though there will need to be a switch controlling how MARC::File::USMARC slurps records, since unfortunately there are plenty of MARC records in the wild whose Leader/00-04 is not trustworthy but where splitting on \x1D and (loosely) parsing the record can be made to work. Show quoted text

> Other than this minor niggle, I find the MARC::Record module to be a > really powerful tool: great stuff!

Thanks!

Tue Aug 09 11:00:52 2011 The RT System itself - Status changed from 'new' to 'open'

Wed Aug 10 11:52:48 2011 m.e.phillips [...] durham.ac.uk - Correspondence added

Subject:	RE: [rt.cpan.org #70169] MARC::File::USMARC gets tripped up if fields contain 0x1D
Date:	Wed, 10 Aug 2011 16:52:16 +0100
To:	<bug-MARC-Record [...] rt.cpan.org>
From:	"PHILLIPS M.E." <m.e.phillips [...] durham.ac.uk>

Show quoted text

> Could you attach such a record for use as a test case? I also maintain > MARC::Charset, so I'm also interested in the III character encoding > extensions in general.

I've attached a zip file, clergy.zip, which contains clergy.out, a file with a single unblocked MARC record output from Millennium. The record can be seen on our OPAC at http://library.dur.ac.uk/record=b2660297~S1 Rather than hunt for a record containing a 0x1d in the field data I have cheated by doctoring this record. The 0x1d appears as part of the closing double quotes round the words "Online Journal" in the 520 note field. Here is an excerpt using hexdump -C: 000008a0 28 42 4f 6e 6c 69 6e 65 20 4a 6f 75 72 6e 61 6c |(BOnline Journal| 000008b0 1b 24 31 7f 20 1d 1b 28 42 20 63 6f 6e 74 61 69 |.$1. ..(B contai| It appears that Millennium subverts the CJK character set in order to put 16-bit Unicode characters into the records. The sequence 1b 24 31 7f 20 1d 1b 28 42 equates to: 1B 24 31 = set G0 to CJK character set 7f 20 1d = invalid CJK code, made up of 7f followed by 20 1d (big-endian UTF-16 code) 1b 28 42 = set G0 to Basic Latin (ASCII) Show quoted text

> Yes, though there will need to be a switch controlling how > MARC::File::USMARC slurps records, since unfortunately there are plenty > of MARC records in the wild whose Leader/00-04 is not trustworthy but > where splitting on \x1D and (loosely) parsing the record can be made to > work.

Yes, I'd forgotten that problem, which I have met before! Another approach would be to check the record length by examining the directory, which has to be pretty accurate in order to parse the fields at all. Incidentally, could I contact you via e-mail to ask one or two questions about MARC::Charset as I am a bit puzzled by the implementation in one or two places. Is your gmail address as shown on CPAN the best way? Matthew

Download clergy.zip
application/x-zip-compressed 1.3k

Message body not shown because it is not plain text.

Wed Aug 10 12:46:06 2011 gmcharlt [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #70169] MARC::File::USMARC gets tripped up if fields contain 0x1D
Date:	Wed, 10 Aug 2011 12:45:57 -0400
To:	bug-MARC-Record [...] rt.cpan.org
From:	Galen Charlton <gmcharlt [...] gmail.com>

Hi, On Wed, Aug 10, 2011 at 11:52 AM, PHILLIPS M.E. via RT <bug-MARC-Record@rt.cpan.org> wrote: Show quoted text

> Queue: MARC-Record > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=70169 > > I've attached a zip file, clergy.zip, which contains clergy.out, a file with a single unblocked MARC record output from Millennium. The record can be seen on our OPAC at http://library.dur.ac.uk/record=b2660297~S1

Thanks for supplying the example and the additional information regarding III's hack of MARC-8. Show quoted text

> Yes, I'd forgotten that problem, which I have met before! Another approach would be to check the record length by examining the directory, which has to be pretty accurate in order to parse the fields at all.

You'd be surprised. I've run into cases where the length and offset values in the directory were completely long, but as long as the number of directory entries corresponds to the number of field terminator characters, I've been able to successfully parse such records. Might be worth adding a parsing mode to MARC::File::USMARC to support that, not that encouraging such sloppy MARC records is a good idea. :) Show quoted text

> Incidentally, could I contact you via e-mail to ask one or two questions about MARC::Charset as I am a bit puzzled by the implementation in one or two places. Is your gmail address as shown on CPAN the best way?

Yes, it is. Regards, Galen -- Galen Charlton gmcharlt@gmail.com