Bug #61448 for Text-BibTeX: Very odd mangling of lowercase a with grave accent

Sat Sep 18 08:53:55 2010 PHILKIME [...] cpan.org - Ticket created

Subject:

Very odd mangling of lowercase a with grave accent

For some reason, lower case a with grave accent in a bib string, appearing on any line other than the first if a field is split over several lines is mangled. This is highly odd and only appears to happen with lower case a-grave. To reproduce use the small code snippet below (happens nomatter what field this occurs in, using a preamble for simplicity). I tried all versions back to 0.40 and same issue. Don't think it's to do with UTF-8 changes I made as they were for names only. This is really odd. Doesn't seem to be in bt_postprocess either as it happens before that. Essentially, it seems like every a-grave after the first newline is mangled (looks like the second byte is turned into a space). ------ #!/usr/bin/perl use Text::BibTeX; $string = q|@PREAMBLE{"à à"}|; print $string, "\n\n"; my $entry = new Text::BibTeX::Entry; $entry->parse_s($string); $pstring = $entry->value; print "$pstring\n";

Sat Sep 18 09:11:29 2010 ambs [...] cpan.org - Correspondence added

Doing some debugging.. [ambs@rachmaninoff tmp]$ perl _.pl |od -cb 0000000 @ P R E A M B L E { " ? ? \n ? ? 100 120 122 105 101 115 102 114 105 173 042 303 240 012 303 240 0000020 " } \n \n ? ? ? \n 042 175 012 012 303 240 040 303 012 0000031 So, à is '303 240' bytes after processing, the first à is printed correctly. The second looses the second byte.

Sat Sep 18 09:11:29 2010 The RT System itself - Status changed from 'new' to 'open'

Sat Sep 18 09:49:35 2010 ambs [...] cpan.org - Correspondence added

we are no luck. It seems to be a problem in the parser...

Sat Sep 18 10:05:37 2010 ambs [...] cpan.org - Correspondence added

On Sat Sep 18 09:49:35 2010, AMBS wrote: Show quoted text

> we are no luck. It seems to be a problem in the parser...

When void zzcr_attr (Attrib *a, int tok, char *txt) is called, txt is already borked.

Sat Sep 18 10:29:43 2010 PHILKIME [...] cpan.org - Correspondence added

Solved. This is a well-known issue with old code using isspace() which on many modern systems gets its information from ctypes.h. These are mostly broken and mis-classify specifically HEX A0 (160) as a space. It was fixed in the linux kernel in 2007 but most other OSes haven't caught up. The attached patch fixes it in btparse by not treating ASCII 160 as space. If search Google, you'll see that many other codebases have had to work around this specifically for ASCII 160. ASCII 160 (0xA0) is the second btye of a-grave .... I'll ask the maintainer to apply this patch and hopefully release 0.47

Subject:

lex_auxiliary.c.patch

--- Text-BibTeX-0.46/btparse/src/lex_auxiliary.c 2010-08-24 18:12:44.000000000 +0200 +++ lex_auxiliary.c 2010-09-18 16:24:01.000000000 +0200 @@ -870,12 +870,14 @@ zzline++; } - /* standardize whitespace (convert all to space) */ + + /* standardize whitespace (convert all to space) but don't accept ascii 160 + as space which most broken ctype.h do as this breaks lots of Unicode things */ len = strlen (zzbegexpr); for (i = 0; i < len; i++) { - if (isspace (zzbegexpr[i])) + if (isspace (zzbegexpr[i]) && zzbegexpr[i] != 160) zzbegexpr[i] = ' '; }

Sat Sep 18 10:29:45 2010 PHILKIME [...] cpan.org - Status changed from 'open' to 'resolved'

Sat Sep 18 10:35:42 2010 ambs [...] cpan.org - Taken

Sat Sep 18 10:39:06 2010 ambs [...] cpan.org - Correspondence added

0.47 on the way to cpan. Have fun :)

Sat Sep 18 10:39:08 2010 The RT System itself - Status changed from 'resolved' to 'open'

Sat Sep 18 10:39:17 2010 ambs [...] cpan.org - Status changed from 'open' to 'resolved'