Bug #2165 for MARC-Record: Incorrect record/field lengths for records with UTF-8 characters

Fri Feb 28 09:20:05 2003 Guest - Ticket created

Subject:

Incorrect record/field lengths for records with UTF-8 characters

I am using MARC::Record together with MARC::Charset to convert data to the UTF-8 character set. The record written out by MARC::Record, however, has an incorrect length in the leader for the record and incorrect lengths in the directory for any field with UTF-8 characters. A comparison of the records with MARC-8 data and UTF-8 data show that the leader and directory are exactly the same. Thus, it appears that MARC::Record is counting *characters*, not *bytes* in the record output. (I am currently working only with records having ASCII/ANSEL data, so there is a direct one-to-one correspondence between the characters in the two records.) The latest version of marcdump (from MARC::Record 1.20) does allow the record to be printed, but with some fields truncated because of the incorrect lengths. Richard A. Lammert Technical Services Librarian Concordia Theological Seminary 6600 N. Clinton St. Fort Wayne, IN 46825-4998

Mon Mar 10 15:34:34 2003 lammertra [...] mail.ctsfw.edu - Correspondence added

From:	"Lammert, Richard" <lammertra [...] mail.ctsfw.edu>
To:	bug-MARC-Record [...] rt.cpan.org
Subject:	RE: [cpan #2165]: Incorrect record/field lengths for records with UTF-8 characters
Date:	Mon, 10 Mar 2003 15:33:23 -0500
RT-Send-Cc:

Additional information on MARC::Record and UTF-8 characters (what works and what doesn't): The procedure I was using to produce UTF-8 encoded MARC records was to go through a batch of MARC records, passing each subfield to_utf8() of MARC::Charset to get the UTF-8 encoding, and replacing each field of the record as the subfields were finished. The result, as I mentioned, was an incorrect record length for any record with characters outside the ASCII range, and incorrect field lengths for any field with characters outside the ASCII range. Trying something different, I started with a text format (OCLC-like) of the MARC records. I passed each line of the text file to_utf8() to produce perfectly formatted UTF-8 text. I then wrote a Perl program to read in the lines, making subfields and fields of the lines, and appending the resulting fields to a new MARC record. The result: a MARC record with no errors. All the lengths are recorded properly in the leader/directory. Thinking that perhaps a subtle difference between replacing and appending fields resulted in the incorrect lengths, I revised the original program that read through a MARC record, and instead of replacing the field with the content converted to UTF-8 encoding, I deleted the original field, and appended the converted field. The result: a MARC record with incorrect lengths, as before. Thus far stands my research on the issue. Richard Rev. Richard A. Lammert e-mail: lammertra@mail.ctsfw.edu Technical Services Librarian mail: 6600 N. Clinton St. Walther Library Fort Wayne, IN 46825-4996 Concordia Theological Seminary phone: 260-452-3148

Thu May 01 12:43:57 2003 esummers [...] cpan.org - Status changed from 'new' to 'resolved'