Subject: | Incorrect application of islower() to UTF-8 characters in btparse |
OS: Windows XP SP3
Perl: ActivePerl 5.16.3.1603
Text-BibTeX: 0.66
Biber: 1.8
Biblatex: 2.8a
TeX Live 2013
Biber is a backend bibliography processor for biblatex (LaTeX bibliography package), which uses Text::BibTeX to process .bib files. Biber's manual declares support of UTF-8 in .bib files.
I use Biber with biblatex and have a .bib file with some author names in Russian, encoded in UTF-8.
Occasionally Biber is unable to figure out author's last name from a value of the form "Иванов, И. И.", when it's in Russian, and emits a warning: "Couldn't determine Last Name for name "Иванов, И. И." - ignoring name".
I investigated the problem a bit and the results are as follows.
Biber constructs a Text::BibTeX::Name object to split names into parts (first name, last name, von, jr). Text::BibTeX::Name delegates this task to the btparse library (bt_split_name() function) written in C, which is shipped with Text::BibTeX.
bt_split_name() eventually calls find_lc_tokens() (both functions are defined in btparse/src/names.c). To determine whether a token starts with a lowercase character, finc_lc_tokens() calls islower(token[0]). islower() is from the standard ctype.h.
As far as the token has type char*, this leads to slicing of multibyte UTF-8 characters. For example, Russian characters are two-byte in UTF-8, so only the first byte of the first character is passed to islower(). This byte is in range 0xD0-0xD3 for Cyrillic, which correspond to negative signed char values.
Moreover, even for single-byte characters char should be cast to unsigned char before passing to islower(), as far as the C standard requires the values to be either representable as unsigned char, or equal EOF.
In my case islower() uses CP-1251 encoding (default Russian encoding in Windows). Therefore for all Cyrillic UTF-8 chars it'll be false, as far as 0xD0-0xD3 are uppercase letters in CP-1251.
Furthermore, if 0xD0-0xD3 are passed to islower() without prior cast to unsigned char, the behavior is undefined (at least in Windows), as far as these are negative values.
So, I think islower(token[0]) isn't correct due to:
a) slicing of multibyte characters
b) possibly negative values passed to islower()
I'm not particularly familiar with the code, so maybe I got something wrong.
Thanks in advance,
Kirill Pushkaryov