Subject: | Issue reading VCF files with PERL library Vcf.pm after complex filtering |
Date: | Fri, 22 Apr 2016 13:09:54 +0200 |
To: | bug-Bio-Pipeline-Comparison [...] rt.cpan.org |
From: | Francesco Musacchia <francescomusacchia [...] gmail.com> |
Hi there,
I want to report this in issue in reading the VCF output format after
GATK-VariantFiltration task.
I was trying to use Vcf.pm to read the output VCF file from that program
but an error came out when parsing the header.
I was just trying the follwoing example code at
http://search.cpan.org/~ajpage/Bio-Pipeline-Comparison-1.123050/lib/Vcf.pm
with my VCF file:
my $vcf = Vcf->new(file=>'example.vcf.gz',region=>'1:1000-2000');
$vcf->parse_header();
# Do some simple parsing. Most thorough but slowest way how to get the data.
while (my $x=$vcf->next_data_hash())
{
for my $gt (keys %{$$x{gtypes}})
{
my ($al1,$sep,$al2) = $vcf->parse_alleles($x,$gt);
print "\t$gt: $al1$sep$al2\n";
}
print "\n";
}
Specifically, I found this is related with the string related to filters
used. The following are the filters as reported in my VCF:
*##FILTER=<ID=HARD_TO_VALIDATE,Description=""QD < 2.0 || FS > 60.0 || MQ <
40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"">
##FILTER=<ID=LowQual,Description=""QUAL > 30.0 && QUAL < 100.0"">*
The double quotes are repeated, as you can see in all the three filters and
this is causing Vcf.pm perl library to fail with:
Could not parse header line: FILTER=<ID=HARD_TO_VALIDATE,Description=""QD <
2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum <
-8.0"">
Stopped at [QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 ||
ReadPosRankSum < -8.0""].
When I go to remove the repeated double-quotes than I solved this. Infact
with ##FILTER=<ID=LowQual,Description="QUAL > 30.0 && QUAL < 100.0"> the
error is not there.
Moreover, one more error comes out because of VariantFiltration. Few lines
later the Command line printed has all the parameters given to
VariantFiltration and this filter again:
*##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625,Date="Sat
Apr 16 12:01:58 CEST 2016",Epoch=1460800918456,CommandLineOptions=.. .. ..
..... filterExpression=["QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum <
-12.5 || ReadPosRankSum < -8.0", "QUAL < 30.0", "QUAL > 30.0 && QUAL <
100.0"] filterName=[HARD_TO_VALIDATE, VeryLowQual, LowQual]
genotypeFilterExpression=[] genotypeFilterName=[] . .. .... >*
And Vcf.pm stops with:
Could not parse header line:
GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625
.....
I solved in a similar way removing the quotes from the filter elements:
*filterExpression=[QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5
|| ReadPosRankSum < -8.0, QUAL < 30.0, QUAL > 30.0 && QUAL < 100.0]*
I do not know if it is a problem in the Vcf.pm library or the VCF format is
not respected in VariantFiltration.
Hope this will be useful
Regards,
Francesco Musacchia
--
Francesco Musacchia - Ph.D.
Computational Biology - Bioinformatics Core
Telethon Institute of Genetics and Medicine
Via Campi Flegrei 34, 80078 Pozzuoli (NA), Italy.
Tel. +39 081 19230692
Mobile: +39 349 6396351