Subject: | Tree drawing is slow for large trees |
Hi Rutger,
First of all, thanks for looking at my previous bug reports and
suggestions. Seeing how the last version of Bio::Phylo was from 2007, I
feared that you weren't maintaining the module anymore.
Anyways, I have a big Newick tree with over 2000 nodes that I plot. But
it takes ~20 minutes to do that, which is quite long.
So, I ran a test script (attached) with the above-mentioned tree
(attached) and profiled it with NYTProf (results attached). It's very
clear that for large tree, all the time is spent in the function
Bio::Phylo::Mediators::NodeMediators::get_link. In particular that piece
of code near line 473 that looks for the parent of a given node:
for my $tuple ( @{$function} ) {
if ( $tuple->[0] == $id ) {
return $node_object_for_id{ $tuple->[1] };
}
}
By inspection, I figured out that @{$function} is an array of all (node,
parent) tuples, and the code goes through each pair to find who the
parent of the node is. It seems that method doesn't scale up very well.
The more nodes, the more tuples, the larger the number of iterations
before finding the parent.
I would like to suggest that maybe you use hashes instead of arrays to
store the parent-child information. Doing so might take a little bit
more of memory and time upfront, but you would be able to query
extremely quick who is the parent of a given node (no loop needed!). I
illustrated this approach in this snippet of code:
my %parentof = ( 'node10' => 'node1', 'node123' => 'node0' );
my $node = 'node123';
my $parent = $parentof{$node};
print "The parent of node $node is $parent\n";
I realize that in the module, you would probably have to create not one
but several hashes to save all of the different types of relations
between nodes: parent, sister, ... I think it's totally worth it.
What do you think?
Regards,
Florent
Subject: | nytprof.tar.gz |
Subject: | VPT.tre |
Message body not shown because it is not plain text.
Subject: | tree_drawing_bench.pl |
#! /usr/bin/env perl
print "Starting\n";
use strict;
use warnings;
use Bio::Phylo::IO;
use Bio::Phylo::Treedrawer;
my $in_file = 'VPT.tre';
print "Test\n";
# Function parameters
my $in_format = 'newick'; # [newick]
my $out_format = 'svg'; # [svg]
my $shape = 'diag'; # [rect|diag|curvy]
my $mode = 'phylo'; # [clado|phylo] # clado is slooooow...
my ($width, $height, $padding, $text_vert_offset, $text_horiz_offset,
$text_width, $font_height, $stroke_width, $line_spacing) = (24279.38, 30696.38,
1458.19, 3, 5, 548, 9, 2, 0.5);
# Parse tree
print "Parsing tree...\n";
my $tree = Bio::Phylo::IO->parse('-format' => $in_format, '-file' => $in_file)->first;
print " done!\n";
# Clean / prepare tree
#$tree = remove_ancestor_names($tree);
#$tree = leaves_taxid_to_names($tree, $taxo_names);
# Create tree
print "Preparing SVG...\n";
my $treedrawer = Bio::Phylo::Treedrawer->new(
-width => $width,
-height => $height,
-padding => $padding,
-text_vert_offset => $text_vert_offset,
-text_horiz_offset => $text_horiz_offset,
-text_width => $text_width,
-shape => $shape,
-mode => $mode,
-format => $out_format
);
print " done!\n";
print "Setting tree...\n";
$treedrawer->set_tree($tree);
print " done!\n";
print "Drawing tree...\n";
#my $count = 1000;
my $count = 1;
for (my $i = 1; $i <= $count; $i++) {
my $svg_tree = $treedrawer->draw;
}
print " done!\n";