Skip Menu |

This queue is for tickets about the Text-BibTeX CPAN distribution.

Report information
The Basics
Id: 61448
Status: resolved
Priority: 0/
Queue: Text-BibTeX

People
Owner: ambs [...] cpan.org
Requestors: PHILKIME [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 0.40
  • 0.40_2
  • 0.40_3
  • 0.41
  • 0.42
  • 0.43
  • 0.44
  • 0.45
  • 0.46
Fixed in: (no value)



Subject: Very odd mangling of lowercase a with grave accent
For some reason, lower case a with grave accent in a bib string, appearing on any line other than the first if a field is split over several lines is mangled. This is highly odd and only appears to happen with lower case a-grave. To reproduce use the small code snippet below (happens nomatter what field this occurs in, using a preamble for simplicity). I tried all versions back to 0.40 and same issue. Don't think it's to do with UTF-8 changes I made as they were for names only. This is really odd. Doesn't seem to be in bt_postprocess either as it happens before that. Essentially, it seems like every a-grave after the first newline is mangled (looks like the second byte is turned into a space). ------ #!/usr/bin/perl use Text::BibTeX; $string = q|@PREAMBLE{"à à"}|; print $string, "\n\n"; my $entry = new Text::BibTeX::Entry; $entry->parse_s($string); $pstring = $entry->value; print "$pstring\n";
Doing some debugging.. [ambs@rachmaninoff tmp]$ perl _.pl |od -cb 0000000 @ P R E A M B L E { " ? ? \n ? ? 100 120 122 105 101 115 102 114 105 173 042 303 240 012 303 240 0000020 " } \n \n ? ? ? \n 042 175 012 012 303 240 040 303 012 0000031 So, à is '303 240' bytes after processing, the first à is printed correctly. The second looses the second byte.
we are no luck. It seems to be a problem in the parser...
On Sat Sep 18 09:49:35 2010, AMBS wrote: Show quoted text
> we are no luck. It seems to be a problem in the parser...
When void zzcr_attr (Attrib *a, int tok, char *txt) is called, txt is already borked.
Solved. This is a well-known issue with old code using isspace() which on many modern systems gets its information from ctypes.h. These are mostly broken and mis-classify specifically HEX A0 (160) as a space. It was fixed in the linux kernel in 2007 but most other OSes haven't caught up. The attached patch fixes it in btparse by not treating ASCII 160 as space. If search Google, you'll see that many other codebases have had to work around this specifically for ASCII 160. ASCII 160 (0xA0) is the second btye of a-grave .... I'll ask the maintainer to apply this patch and hopefully release 0.47
Subject: lex_auxiliary.c.patch
--- Text-BibTeX-0.46/btparse/src/lex_auxiliary.c 2010-08-24 18:12:44.000000000 +0200 +++ lex_auxiliary.c 2010-09-18 16:24:01.000000000 +0200 @@ -870,12 +870,14 @@ zzline++; } - /* standardize whitespace (convert all to space) */ + + /* standardize whitespace (convert all to space) but don't accept ascii 160 + as space which most broken ctype.h do as this breaks lots of Unicode things */ len = strlen (zzbegexpr); for (i = 0; i < len; i++) { - if (isspace (zzbegexpr[i])) + if (isspace (zzbegexpr[i]) && zzbegexpr[i] != 160) zzbegexpr[i] = ' '; }
0.47 on the way to cpan. Have fun :)