Bug #48056 for BibTeX-Parser:

Tue Jul 21 11:32:51 2009 TIMBRODY [...] cpan.org - Ticket created

Hi, Thanks for releasing this module. I don't think you want to "_sanitize" field values. BibTeX can contain any (La)TeX sequence to enable complex characters to be constructed e.g. for u-umlaut you use "M{\"u}ller", so just stripping braces and de-escaping slashed values won't give the correct result. This also impacts on name parsing this correctly: "{Barnes and Noble, Inc.}". Essentially you need to do token-parsing on names as well ... I think Text::Balanced is a bit of a memory hog (at least for the functionality gained ...). This patch will remove the requirement for that module: --- Parser.pm.1 2009-07-19 16:44:40.000000000 +0100 +++ Parser.pm 2009-07-21 16:22:08.000000000 +0100 @@ -5,8 +5,6 @@ our $VERSION = '0.3'; -use Text::Balanced qw(extract_bracketed extract_delimited); - use BibTeX::Parser::Entry; =for stopwords jr von @@ -239,7 +237,7 @@ { # quoted string with embeded escapes $value .= $1; } else { - my $part = ( extract_bracketed( $_, "{}" ) )[0]; + my $part = extract_bracketed( $_ ); $value .= substr $part, 1, length($part) - 2; # strip quotes } @@ -251,4 +249,23 @@ return $value; } -1; # End of BibTeX::Parser \ No newline at end of file +sub _extract_bracketed +{ + for($_[0]) # alias to $_ + { + /\G\s+/cg; + my $start = pos($_); + my $depth = 0; + while(1) + { + /\G\\./cg && next; + /\G\{/cg && (++$depth, next); + /\G\}/cg && (--$depth > 0 ? next : last); + /\G([^\{\}]+)/cg && next; + last; # end of string + } + return substr($_, $start, pos($_)-$start); + } +} + +1; # End of BibTeX::Parser

Tue Jul 21 12:09:15 2009 TIMBRODY [...] cpan.org - Correspondence added

That patch should have been this: --- Parser.pm.1 2009-07-19 16:44:40.000000000 +0100 +++ Parser.pm 2009-07-21 17:07:52.000000000 +0100 @@ -5,8 +5,6 @@ our $VERSION = '0.3'; -use Text::Balanced qw(extract_bracketed extract_delimited); - use BibTeX::Parser::Entry; =for stopwords jr von @@ -239,7 +237,7 @@ { # quoted string with embeded escapes $value .= $1; } else { - my $part = ( extract_bracketed( $_, "{}" ) )[0]; + my $part = _extract_bracketed( $_ ); $value .= substr $part, 1, length($part) - 2; # strip quotes } @@ -251,4 +249,23 @@ return $value; } -1; # End of BibTeX::Parser \ No newline at end of file +sub _extract_bracketed +{ + for($_[0]) # alias to $_ + { + /\G\s+/cg; + my $start = pos($_); + my $depth = 0; + while(1) + { + /\G\\./cg && next; + /\G\{/cg && (++$depth, next); + /\G\}/cg && (--$depth > 0 ? next : last); + /\G([^\\\{\}]+)/cg && next; + last; # end of string + } + return substr($_, $start, pos($_)-$start); + } +} + +1; # End of BibTeX::Parser

Tue Jul 21 12:09:16 2009 TIMBRODY [...] cpan.org - Status changed from 'new' to 'open'

Sun Jul 26 14:09:34 2009 GERHARD [...] cpan.org - Correspondence added

Subject:

De-TeX-ification of values

Thanks for the patch, it is included in version 0.3.2 hitting CPAN now. Re the sanitize function: I want the parser to extract values useable outside a TeX/BibTeX environment, so I'll try to convert everything to Unicode. A future version of the module will probably use somethin like TeX::Encode to do that. Do you need the raw text of a field in your use of the parser? If yes, I could add a method field_raw to return the original text, but that would mean I need to store both the raw and the detexified version.

Mon Jul 27 06:57:03 2009 TIMBRODY [...] cpan.org - Correspondence added

On Sun Jul 26 14:09:34 2009, GERHARD wrote: Show quoted text

> Thanks for the patch, it is included in version 0.3.2 hitting CPAN now. > > Re the sanitize function: I want the parser to extract values useable > outside a TeX/BibTeX environment, so I'll try to convert everything to > Unicode. A future version of the module will probably use somethin like > TeX::Encode to do that.

I have a pile of changes to release for TeX::Encode (including escaping/unescaping BibTeX strings). It's ugly but the only approach that seems possible outside of re-writing TeX is to use big mapping tables. You can't write a parser for TeX but you need to sort-of parse it to make a reasonable trans-code: \"e = \"{e} = {\"e} = {\"{e}} = \"e{} Arg! Show quoted text

> Do you need the raw text of a field in your use of the parser? If yes, I > could add a method field_raw to return the original text, but that would > mean I need to store both the raw and the detexified version.

I need the raw TeX/text to throw at TeX::Encode. To process names correctly you will need to do some brace magic though: "Ludwig {van Beethoven}" (lastname="van Beethoven")

Tue Apr 27 11:08:12 2010 TIMBRODY [...] cpan.org - Correspondence added

Attached is a patch to: 1) Disable "sanitising" bibtex values 2) Better parsing of author names (respects braces) This makes use of a nbsp (0xa0) kludge to stop name part splitting happening inside braces.

Subject:

bibtex_author_patch.diff

Index: perl_lib/BibTeX/Parser/Entry.pm =================================================================== --- perl_lib/BibTeX/Parser/Entry.pm (revision 5333) +++ perl_lib/BibTeX/Parser/Entry.pm (revision 5347) @@ -183,34 +183,31 @@ return () if !defined $field || $field eq ''; my @names; - - my $buffer; - while (!defined pos $field || pos $field < length $field) { - if ( $field =~ /\G ( .* ) ( \{ | \s+ and \s+ )/xcgi ) { - my $match = $1; - if ( $2 =~ /and/i ) { - $buffer .= $match; - push @names, $buffer; - $buffer = ""; - } elsif ( $2 =~ /\{/ ) { - $buffer .= "{" . $match; - if ( $field =~ /\G (.* \})/cgx ) { - $buffer .= $1; - } else { - die "Missing closing brace at " . substr( $field, pos $field, 10 ); + my $name = ''; + my $inbrace = 0; + for($field) + { + pos($_) = 0; + while(pos $_ < length $_) + { + /\G(\{)/cg && (($name .= $1), ++$inbrace, next); + /\G(\})/cg && (($name .= $1), --$inbrace, next); + $inbrace && /\G([^\{\}]+)/cg && (($name .= _nbsp($1)), next); + /\G([^\{\}]*?)\sand\s+/cig && (push(@names, $name.$1), $name='', next); + /\G([^\{\}]+)/cg && (($name .= $1), next); # last name } - } else { - $buffer .= $match; - } - } else { - $buffer .= substr $field, (pos $field || 0); - last; } - } - push @names, $buffer if $buffer; - return @names; + push @names, $name if length($name); + return @names; } +sub _nbsp +{ + my( $str ) = @_; + $str =~ s/\s/\xa0/g; + return $str; +} + =head2 author([@authors]) Get or set the authors. Returns an array of L<BibTeX::Author|BibTeX::Author> @@ -264,6 +261,7 @@ } sub _sanitize_field { +return shift; my $value = shift; for ($value) { tr/\{\}//d; Index: perl_lib/BibTeX/Parser/Author.pm =================================================================== --- perl_lib/BibTeX/Parser/Author.pm (revision 5333) +++ perl_lib/BibTeX/Parser/Author.pm (revision 5347) @@ -119,7 +119,7 @@ $name =~ s/^\s*(.*)\s*$/$1/s; if ( $name =~ /^\{\s*(.*)\s*\}$/ ) { - return (undef, undef, $1, undef); + return _nbsp(undef, undef, $1, undef); } my @parts = split /\s*,\s*/, $name; @@ -142,13 +142,13 @@ } if (@name_parts) { - return ($first, $von, join(" ", @name_parts), undef); + return _nbsp($first, $von, join(" ", @name_parts), undef); } else { - return (undef, undef, $name, undef); + return _nbsp(undef, undef, $name, undef); } } else { - if ($name =~ /^((.*)\s+)?\b(\S+)$/) { - return ($2, undef, $3, undef); + if ($name =~ /^(?:(.*)\s+)?(\S+)$/) { + return _nbsp($1, undef, $2, undef); } } @@ -159,7 +159,7 @@ while ( lc($von_last_parts[0]) eq $von_last_parts[0] ) { $von .= $von ? ' ' . shift @von_last_parts : shift @von_last_parts; } - return ($parts[1], $von, join(" ", @von_last_parts), undef); + return _nbsp($parts[1], $von, join(" ", @von_last_parts), undef); } else { my @von_last_parts = split /\s+/, $parts[0]; my $von; @@ -167,9 +167,18 @@ while ( lc($von_last_parts[0]) eq $von_last_parts[0] ) { $von .= $von ? ' ' . shift @von_last_parts : shift @von_last_parts; } - return ($parts[2], $von, join(" ", @von_last_parts), $parts[1]); + return _nbsp($parts[2], $von, join(" ", @von_last_parts), $parts[1]); } +} +sub _nbsp +{ + my @parts = @_; + for(@parts) + { + $_ =~ s/\xa0/ /g if defined $_; + } + return @parts; } =head2 to_string @@ -188,4 +197,4 @@ } } -1; # End of BibTeX::Entry \ No newline at end of file +1; # End of BibTeX::Entry

Tue Jul 06 08:17:14 2010 gerhard.gossen [...] googlemail.com - Correspondence added

Subject:	Re: [rt.cpan.org #48056]
Date:	Tue, 6 Jul 2010 13:23:37 +0200
To:	bug-BibTeX-Parser [...] rt.cpan.org
From:	Gerhard Gossen <gerhard.gossen [...] googlemail.com>

Thanks for the patch. I just uploaded version 0.5 which contains the patch to disable sanitising the values (the next version will have a seperate method to get completly decoded values) and parses author names with braces. I decided to go a different path for the second part than your patch, because I use the method B::P::Author::split in different context with the plain string, so this made more sense to me. Also, it makes testing easier. If you have any names that still break the parser, I'd be glad to have additional tests for those names (they are in t/05-author.t)

Tue Mar 15 18:16:53 2011 GERHARD [...] cpan.org - Status changed from 'open' to 'resolved'