CC: | samtools-devel <samtools-devel [...] lists.sourceforge.net> |
Subject: | Bio::DB::Sam, patch -> performance improvement |
Date: | Wed, 24 Mar 2010 14:08:37 +0000 |
To: | bug-Bio-SamTools [...] rt.cpan.org |
From: | Keiran Raine <kr2 [...] sanger.ac.uk> |
Hi,
I'd like to suggest the following patch to Bio::DB::Sam.pm
1853c1853,1856
< return Bio::DB::Bam->index($self->{bam_path});
---
Show quoted text
> if(!$self->{bai}) {
> $self->{bai} = Bio::DB::Bam->index($self->{bam_path});
> }
> return $self->{bai};
The caches the index file for the current SAM object rather than
calling down to the C level everytime it is requested. We found this
was a major bottleneck when running multiple pileups over the same BAM
file (where a Bio::DB::Sam object persists for each).
Example:
Perl script checking 31 locations in 28 BAM files
pos1 vs. bam1
pos1 vs. bam2
pos1 vs. bam3
pos1 vs. bam.....
pos2 vs. bam1
pos2 vs. bam2....
Original dprof analysis:
%Time ExclSec CumulS #Calls sec/call Csec/c Name
45.6 29.29 29.290 416 0.0704 0.0704 Bio::DB::Bam::index_open
After change:
%Time ExclSec CumulS #Calls sec/call Csec/c Name
23.7 2.130 2.130 28 0.0761 0.0761 Bio::DB::Bam::index_open
The drop in the calls drops the run time from ~70 seconds to ~10.
Kind regards,
Keiran Raine
Senior Computer Biologist
The Cancer Genome Project
Ext: 2100
kr2@sanger.ac.uk
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.