Bug #108901 for Spreadsheet-XLSX: Handling namespace prefixes in OpenDocument XML

Subject:	Handling namespace prefixes in OpenDocument XML
Date:	Fri, 13 Nov 2015 17:13:52 +0000
To:	bug-Spreadsheet-XLSX [...] rt.cpan.org
From:	Richard Lewis <richard [...] rjlewis.me.uk>

Hi there, I've been trying to import some XLSX spreadsheets and was finding that Spreadsheet::XLSX (v0.15) couldn't find the worksheets in the file, and then later that it couldn't retrieve any of the cell values. I run my script in the debugger and stepped through the Spreadsheet::XLSX->_load_workbook subroutine, looking especially at the loop which begins: foreach ($member_workbook -> contents =~ /\<(.*?)\/?\>/g) { The first line of this look is a pattern match: /^(\w+)\s+/; which is the first word inside the tag, followed by everthing else. Now, for tags such as: <sheet name="Sheet 3" sheetId="3" r:id="rId3" /> this works OK, because "sheet" will match. And then the: $tag eq 'sheet' or next; test will pass. However, in my XLSX file I found that the xl/workbook.xml member file was encoded with namespace prefixes for all the tags; I had, for example: <x:sheet name="Sheet 3" sheetId="3" r:id="rId3" /> where the x namespace was defined in the root node like this: <x:workbook [...] xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> Consequently, the /^(\w+)\s+/ pattern did not match "x:sheet", and so none of the sheets in the workbook were found. Simply changing the pattern to: /^x:(\w+)\s+/ would fix the problem for my particular spreadsheet. But it's not a correct solution as there's no requirement that the workbook.xml file use namespace prefixes, and there's also no requirement that any prefix must be called "x". After experimenting with this, I went on to find that some (but not all!) of the xl/worksheets/sheet*.xml files used namespace prefixes and so had problems retrieving the cell values. I started trying some fix-ups in the region of: my $parsing_v_tag = 0; my $s = 0; my $s2 = 0; my $sty = 0; foreach ($member_sheet->contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) { if (/^\<c\s*.*?\s*r=\"([A-Z])([A-Z]?)(\d+)\"/) { ($row, $col) = __decode_cell_name($1, $2, $3); but eventually got too confused! Of course, this comes about as a result of processing XML using text techniques (i.e. regexes). While I definitely see the performance advantages of this (over using a library to build a DOM, for example), we do have these drawbacks of having to account for all the possibilities of XML serialisation in the wild. Any thoughts on how we might be able to get this fixed? I guess some careful re-working of the regexes might be sufficient. Or possibly re-writing to use an XML parser; maybe one with a SAX API? Richard -- http://web.rjlewis.me.uk/