Subject: | Handling namespace prefixes in OpenDocument XML |
Date: | Fri, 13 Nov 2015 17:13:52 +0000 |
To: | bug-Spreadsheet-XLSX [...] rt.cpan.org |
From: | Richard Lewis <richard [...] rjlewis.me.uk> |
Hi there,
I've been trying to import some XLSX spreadsheets and was finding that
Spreadsheet::XLSX (v0.15) couldn't find the worksheets in the file,
and then later that it couldn't retrieve any of the cell values.
I run my script in the debugger and stepped through the
Spreadsheet::XLSX->_load_workbook subroutine, looking especially at
the loop which begins:
foreach ($member_workbook -> contents =~ /\<(.*?)\/?\>/g) {
The first line of this look is a pattern match:
/^(\w+)\s+/;
which is the first word inside the tag, followed by everthing
else. Now, for tags such as:
<sheet name="Sheet 3" sheetId="3" r:id="rId3" />
this works OK, because "sheet" will match. And then the:
$tag eq 'sheet' or next;
test will pass. However, in my XLSX file I found that the
xl/workbook.xml member file was encoded with namespace prefixes for
all the tags; I had, for example:
<x:sheet name="Sheet 3" sheetId="3" r:id="rId3" />
where the x namespace was defined in the root node like this:
<x:workbook [...] xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
Consequently, the /^(\w+)\s+/ pattern did not match "x:sheet", and so
none of the sheets in the workbook were found.
Simply changing the pattern to:
/^x:(\w+)\s+/
would fix the problem for my particular spreadsheet. But it's not a
correct solution as there's no requirement that the workbook.xml file
use namespace prefixes, and there's also no requirement that any
prefix must be called "x".
After experimenting with this, I went on to find that some (but not
all!) of the xl/worksheets/sheet*.xml files used namespace prefixes
and so had problems retrieving the cell values. I started trying some
fix-ups in the region of:
my $parsing_v_tag = 0;
my $s = 0;
my $s2 = 0;
my $sty = 0;
foreach ($member_sheet->contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) {
if (/^\<c\s*.*?\s*r=\"([A-Z])([A-Z]?)(\d+)\"/) {
($row, $col) = __decode_cell_name($1, $2, $3);
but eventually got too confused!
Of course, this comes about as a result of processing XML using text
techniques (i.e. regexes). While I definitely see the performance
advantages of this (over using a library to build a DOM, for example),
we do have these drawbacks of having to account for all the
possibilities of XML serialisation in the wild.
Any thoughts on how we might be able to get this fixed? I guess some
careful re-working of the regexes might be sufficient. Or possibly
re-writing to use an XML parser; maybe one with a SAX API?
Richard
--
http://web.rjlewis.me.uk/