Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Bio-Phylo CPAN distribution.

Report information
The Basics
Id: 18208
Status: resolved
Priority: 0/
Queue: Bio-Phylo

People
Owner: Nobody in particular
Requestors: easmith [...] beatrice.rutgers.edu
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bio::Phylo::Parsers::Newick regexp bug
Date: Thu, 16 Mar 2006 11:58:10 -0500
To: bug-bio-phylo [...] rt.cpan.org
From: Allen Smith <easmith [...] beatrice.rutgers.edu>
Thanks for Bio::Phylo! $Id: Newick.pm,v 1.22 2005/09/29 20:31:18 rvosa Exp $ If I try feeding a tree into Bio::Phylo::IO for parsing with newick format, and said tree has a (bracketed) comment in it (which is normally allowed anyplace a newline is allowed - http://evolution.genetics.washington.edu/phylip/newick_doc.html), such as the log likelihood from tree-puzzle-5.2, an error (generally) happens, such as: Invalid [] range "=-4" in regex; marked by <-- HERE in m/^.*[,|\)|\(][lh=-4 <-- HERE 464.484953]([,|:|\)|;].*)$/ (The tree in question, which is also not read correctly for the material in '' - I understand parsing quotes and escapes is a headache, having tried to do it myself! - is: [ lh=-4464.484953 ](Methanococcus_voltae:0.32692,(('Pyrococcus furiosus (includes Pyrococcus woesei)':0.05887,Pyrococcus_abyssi:0.03869)100:0.36861, (((Sulfolobus_solfataricus:0.08344,Sulfolobus_tokodaii:0.10668)100:0.15268,Aeropyrum_pernix:0.20003) 100:0.09351,Desulfuroccus_amylolyticus:0.18345)100:0.28706)100:0.41157,'Methanococcus jannaschii (aka Methanocaldococcus jannaschii)':0.00001); If one lists all the names of the nodes retrieved from the above, one gets: Methanococcus_voltae 'Pyrococcusfuriosus includesPyrococcuswoesei ' Pyrococcus_abyssi 100 Sulfolobus_solfataricus Sulfolobus_tokodaii 100 Aeropyrum_pernix 100 Desulfuroccus_amylolyticus 100 100 'Methanococcusjannaschii akaMethanocaldococcusjannaschii ' n1 ) The problem appears to be that Newick.pm has a function, _parse_string, with a bug in it: my ( $st, $depth, $name ) = ( $string, 0, $node->get_name ); $st =~ s/^.*[,|\)|\(]$name([,|:|\)|;].*)$/$1/; $name in the above should be quotemeta'd, and comments (in []) should, if possible, be eliminated earlier - probably replaced with newlines. (I am also curious as to the reason for the | symbols in the character classes ([]); I can't see what they're doing, unless they're simply to make it easier to read...) Thanks again for Bio::Phylo, -Allen -- Allen Smith http://cesario.rutgers.edu/easmith/ September 11, 2001 A Day That Shall Live In Infamy II "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." - Benjamin Franklin
Thank you for bringing this to my attention. I will try to be more accomodating towards quoted strings and such in newick. Rutger Vos On Thu Mar 16 11:55:27 2006, easmith@beatrice.rutgers.edu wrote: Show quoted text
> > Thanks for Bio::Phylo! > > $Id: Newick.pm,v 1.22 2005/09/29 20:31:18 rvosa Exp $ > > If I try feeding a tree into Bio::Phylo::IO for parsing with newick > format, > and said tree has a (bracketed) comment in it (which is normally > allowed > anyplace a newline is allowed - > http://evolution.genetics.washington.edu/phylip/newick_doc.html), such > as > the log likelihood from tree-puzzle-5.2, an error (generally) happens, > such > as: > > Invalid [] range "=-4" in regex; marked by <-- HERE in > m/^.*[,|\)|\(][lh=-4 > <-- HERE 464.484953]([,|:|\)|;].*)$/ > > (The tree in question, which is also not read correctly for the > material in > '' - I understand parsing quotes and escapes is a headache, having > tried to > do it myself! - is: > > [ lh=-4464.484953 ](Methanococcus_voltae:0.32692,(('Pyrococcus > furiosus (includes Pyrococcus > woesei)':0.05887,Pyrococcus_abyssi:0.03869)100:0.36861, >
(((Sulfolobus_solfataricus:0.08344,Sulfolobus_tokodaii:0.10668)100:0.15268,Aeropyrum_pernix:0.20003) Show quoted text
>
100:0.09351,Desulfuroccus_amylolyticus:0.18345)100:0.28706)100:0.41157,'Methanococcus Show quoted text
> jannaschii (aka Methanocaldococcus jannaschii)':0.00001); > > If one lists all the names of the nodes retrieved from the above, one > gets: > > Methanococcus_voltae > 'Pyrococcusfuriosus > includesPyrococcuswoesei > ' > Pyrococcus_abyssi > 100 > Sulfolobus_solfataricus > Sulfolobus_tokodaii > 100 > Aeropyrum_pernix > 100 > Desulfuroccus_amylolyticus > 100 > 100 > 'Methanococcusjannaschii > akaMethanocaldococcusjannaschii > ' > n1 > ) > > The problem appears to be that Newick.pm has a function, > _parse_string, with > a bug in it: > > my ( $st, $depth, $name ) = ( $string, 0, $node->get_name ); > $st =~ s/^.*[,|\)|\(]$name([,|:|\)|;].*)$/$1/; > > $name in the above should be quotemeta'd, and comments (in []) should, > if > possible, be eliminated earlier - probably replaced with newlines. (I > am > also curious as to the reason for the | symbols in the character > classes > ([]); I can't see what they're doing, unless they're simply to make it > easier to read...) > > Thanks again for Bio::Phylo, > > -Allen >
Dear Allen (and other readers), I have changed the newick parser to address this issue. From now on, the test suite includes a test file 'regress_18208.t' that parses the string you provided in the ticket and checks if the tip names match up with those in the tree description (and they do in v.0.16, for which a release candidate will appear on CPAN this week). Sorry about the wait, hope you are still interested in Bio::Phylo. I now consider this ticket resolved. Best wishes, Rutger