Bug #92864 for Text-BibTeX: Incorrect application of islower() to UTF-8 characters in btparse

Sun Feb 09 13:28:09 2014 broker5 [...] rambler.ru - Ticket created

Subject:

Incorrect application of islower() to UTF-8 characters in btparse

OS: Windows XP SP3 Perl: ActivePerl 5.16.3.1603 Text-BibTeX: 0.66 Biber: 1.8 Biblatex: 2.8a TeX Live 2013 Biber is a backend bibliography processor for biblatex (LaTeX bibliography package), which uses Text::BibTeX to process .bib files. Biber's manual declares support of UTF-8 in .bib files. I use Biber with biblatex and have a .bib file with some author names in Russian, encoded in UTF-8. Occasionally Biber is unable to figure out author's last name from a value of the form "Иванов, И. И.", when it's in Russian, and emits a warning: "Couldn't determine Last Name for name "Иванов, И. И." - ignoring name". I investigated the problem a bit and the results are as follows. Biber constructs a Text::BibTeX::Name object to split names into parts (first name, last name, von, jr). Text::BibTeX::Name delegates this task to the btparse library (bt_split_name() function) written in C, which is shipped with Text::BibTeX. bt_split_name() eventually calls find_lc_tokens() (both functions are defined in btparse/src/names.c). To determine whether a token starts with a lowercase character, finc_lc_tokens() calls islower(token[0]). islower() is from the standard ctype.h. As far as the token has type char*, this leads to slicing of multibyte UTF-8 characters. For example, Russian characters are two-byte in UTF-8, so only the first byte of the first character is passed to islower(). This byte is in range 0xD0-0xD3 for Cyrillic, which correspond to negative signed char values. Moreover, even for single-byte characters char should be cast to unsigned char before passing to islower(), as far as the C standard requires the values to be either representable as unsigned char, or equal EOF. In my case islower() uses CP-1251 encoding (default Russian encoding in Windows). Therefore for all Cyrillic UTF-8 chars it'll be false, as far as 0xD0-0xD3 are uppercase letters in CP-1251. Furthermore, if 0xD0-0xD3 are passed to islower() without prior cast to unsigned char, the behavior is undefined (at least in Windows), as far as these are negative values. So, I think islower(token[0]) isn't correct due to: a) slicing of multibyte characters b) possibly negative values passed to islower() I'm not particularly familiar with the code, so maybe I got something wrong. Thanks in advance, Kirill Pushkaryov

Wed Feb 19 16:41:59 2014 PHILKIME [...] cpan.org - Correspondence added

On Sun Feb 09 13:28:09 2014, broker5@rambler.ru wrote: Show quoted text

> OS: Windows XP SP3 > Perl: ActivePerl 5.16.3.1603 > Text-BibTeX: 0.66 > Biber: 1.8 > Biblatex: 2.8a > TeX Live 2013

Thanks - I'll look into this. I put some UTF-8 handling into btparse for biber so it could generate initials from names but I didn't cover this. Perhaps you can open a case on the biber github project page with a small example demonstrating the problem?

Wed Feb 19 16:41:59 2014 The RT System itself - Status changed from 'new' to 'open'

Thu Feb 20 06:57:43 2014 PHILKIME [...] cpan.org - Correspondence added

I'll need a MWE for this to test as when I try this with "AUTHOR = {Иванов, И. И.}" I have no problems?

Sat Feb 22 17:52:27 2014 broker5 [...] rambler.ru - Correspondence added

From:

broker5 [...] rambler.ru

Чтв Фев 20 06:57:43 2014, PHILKIME писал: Show quoted text

> I'll need a MWE for this to test as when I try this with "AUTHOR = > {Иванов, И. И.}" I have no problems?

Unfortunately, it's hard to produce a clear MWE, because in my case the problem appears or disappears depending on the state of environment variables (not particular ones, but random ones). That is, I add a variable with random name and value and the problem appears, I add another variable and it disappears. Debugging revealed that the result returned by islower() to finc_lc_tokens() depend on the environment for invalid arguments. And that's what just happens when btparse processes Cyrillic characters. For example, "И" (Cyrillic capital letter I) is encoded as 0xD0 0x98 in UTF-8. Btparse severs the first byte 0xD0 (represented as a signed char -0x30) and calls islower(-0x30), which results in undefined behavior. MSDN states (http://msdn.microsoft.com/en-us/library/1z2s6by9.aspx): "The behavior of islower and _islower_l is undefined if c is not EOF or in the range 0 through 0xFF, inclusive. When a debug CRT library is used and c is not one of these values, the functions raise an assertion." So, this code in btparse just can't work for multibyte characters. For simple names without a von part this fact may be masked by undefined behavior of islower() so that the result is occasionally right. Could you try "author = {фон дер Иванов, И. И.}" and look if the von part is properly extracted by Text::BibTeX::Name? For me it's not, it's included in the last name instead.

Wed Feb 26 06:10:45 2014 PHILKIME [...] cpan.org - Correspondence added

I notice you are using Windows XP. I think there are two problems here. I also have strange UTF-8 issues on Windows XP with btparse which don't occur on any other later Windows OS. This shows up mainly in the btparse test suite and the biber test suite which fail on UTF-8 name tests (sometimes a bit randomly, as you say). There is not much to be done about this - Windows XP UTF-8 handling in general is a bit flakey. All tests pass fine on, for example, Windows 7 and 8. However, I think you raise a good point about islower - the is a different problem and related to the lack of complete UTF-8 support in btparse. I added some support for generating initials from names but tackling islower is harder. I will see what I can do. I don't have the time to try to write a new btparse with real UTF-8 support unfortunately ...

Thu Feb 27 15:44:42 2014 PHILKIME [...] cpan.org - Correspondence added 1200 min

I'm happy to say that 0.68 should fix this. I have replaced islower() with a custom isulower() which detects about 1700+ glyphs in Unicode 6.2.0 with the LOWERCASE property. This seems to fix the Windows XP instability and also your example now correctly splits into the prefix parts in my tests (your example is now part of the test suite for the module). It should be on CPAN soonish but if you can't wait, you can pull the "v0.68" tag from github and build it from there.

Thu Feb 27 15:44:43 2014 PHILKIME [...] cpan.org - Taken

Thu Feb 27 17:04:19 2014 PHILKIME [...] cpan.org - Correspondence added

Small correction - version 0.68 will probably not exist due to problems with PAUSE. Look for version 0.69 very soon.

Fri Feb 28 04:50:37 2014 PHILKIME [...] cpan.org - Correspondence added

0.69 is now on CPAN and biber 1.9 DEV version on SourceForge is now updated with this version so you can test this.

Sat Mar 01 09:25:37 2014 PHILKIME [...] cpan.org - Broken in 0.67 added

Sat Mar 01 09:25:38 2014 PHILKIME [...] cpan.org - Broken in 0.66 deleted

Sat Mar 01 09:25:38 2014 PHILKIME [...] cpan.org - Fixed in 0.69 added

Sat Mar 01 15:58:23 2014 broker5 [...] rambler.ru - Correspondence added

From:

broker5 [...] rambler.ru

Птн Фев 28 04:50:37 2014, PHILKIME писал: Show quoted text

> 0.69 is now on CPAN and biber 1.9 DEV version on SourceForge is now > updated with this version so you can test this.

I replaced libbtparse.dll with the new version and it works fine with the non-packed distribution of biber 1.8. I also tried the new development binary of biber 1.9b. It crashes on bibliography entries with 4 or more authors when biblatex option style=gost-numeric is used. I presume, this is due to incompatibility between the new biber and old biblatex and gost-numeric. Otherwise biber 1.9b binary works, though it complains to the console about "Use of uninitialized value $thislocale in concatenation (.) or string" at Biber.pm line 2924, which is unrelated to Text-BibTeX. Moreover, I tested isulower() for Russian alphabet and it's OK. I'll stay with biber 1.8 and the fixed libbtparse for the time being. The problem seems to be fixed. Thank you very much! P. S. Maybe the "strange UTF-8 issues on Windows XP" you mentioned are due to bugs in msvcrt.dll on Windows XP, which MinGW links applications to. For example, previously I bumped into improper synchronization in setlocale() in msvcrt.dll causing heap corruption. The bug was fixed in subsequent versions of Windows and Visual C++'s public versions of CRT (msvcrt.dll itself isn't intended for public use: http://msdn.microsoft.com/en-us/library/abx4dbyh.aspx), but won't be fixed in XP.

Sun Mar 02 13:05:57 2014 PHILKIME [...] cpan.org - Correspondence added

On Sat Mar 01 15:58:23 2014, broker5@rambler.ru wrote: Show quoted text

> I also tried the new development binary of biber 1.9b. It crashes on > bibliography entries with 4 or more authors when biblatex option > style=gost-numeric is used. I presume, this is due to incompatibility > between the new biber and old biblatex and gost-numeric.

Yes the biber and biblatex versions are rather tightly coupled now and it's best to use the right versions together. Show quoted text

> Otherwise biber 1.9b binary works, though it complains to the console > about "Use of uninitialized value $thislocale in concatenation (.) or > string" at Biber.pm line 2924, which is unrelated to Text-BibTeX.

Hmm, that shouldn't happen with the latest 2.9 DEV biblatex - is that what you are using? Show quoted text

> P. S. Maybe the "strange UTF-8 issues on Windows XP" you mentioned are > due to bugs in msvcrt.dll on Windows XP, which MinGW links > applications to. For example, previously I bumped into improper > synchronization in setlocale() in msvcrt.dll causing heap corruption. > The bug was fixed in subsequent versions of Windows and Visual C++'s > public versions of CRT (msvcrt.dll itself isn't intended for public > use: http://msdn.microsoft.com/en-us/library/abx4dbyh.aspx), but won't > be fixed in XP.

Ah, ok. I won't care about the Windows XP issues too much then. However, the new isulower() does seem to make my Windows XP test machines pass all tests without complaint now ... If you can confirm the '$thissetlocale' issue is caused by using an older biblatex (pre 2.9), I will close this.

Tue Mar 04 16:06:54 2014 broker5 [...] rambler.ru - Correspondence added

From:

broker5 [...] rambler.ru

Вск Мар 02 13:05:57 2014, PHILKIME писал: Show quoted text

> If you can confirm the '$thissetlocale' issue is caused by using an > older biblatex (pre 2.9), I will close this.

I was using biblatex 2.8. With biblatex 2.9 I see no crashes or warnings.

Thu Mar 06 05:41:35 2014 PHILKIME [...] cpan.org - Status changed from 'open' to 'resolved'