Bug #43247 for Spreadsheet-XLSX: two problems in treating shared string table

Subject:	two problems in treating shared string table
Date:	Fri, 13 Feb 2009 11:24:28 +0900
To:	bug-Spreadsheet-XLSX [...] rt.cpan.org
From:	okina [...] is.s.u-tokyo.ac.jp

Hi, Trying to load the excel 2007 file, I encountered two problems below. So I send you a patch. My environment is: * Spreadsheet-XLSX-0.09 * Linux version 2.6.9-023stab040.1-enterprise (root@rhel4-32) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Mon Jan 15 22:56:55 MSK 2007 * perl, v5.8.5 built for i386-linux-thread-multi. 1) The loaded context includes character entity references literally. 2) Due to existence of 'Phonetic Properties' items for Japanese excel files, Spreadsheet::XLSX misaligns the indices of items in the shared string table. Phonetic items represents pronunciation hints for some East Asian languages. In the file 'xl/sharedStrings.xml', the phonetic properties appear like: <si> <t>(a japanese text in KANJI)</t> <rPh sb="0" eb="1"> <t>(its pronounciation in KATAKANA)</t> </rPh> </si> Then, the routine in Spreadsheet::XLSX::new(), foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) , wrongly extracts the phonetic items as normal string items, by only searching '<t>' tag. This problem is not a special case, but may express at many XLSX files created by Japanese version of Excel, because phonetic properties are inserted automatically by Excel(and IME). * See details for the file formats of OOXML in: http://www.ecma-international.org/publications/standards/Ecma-376.htm The section '1st edition Part 4' states its markup language reference. According to the reference, this problem can be caused only by '<rPh>' tags. Therefore, I wrote a simple patch for fixing these bugs. Note that I think that it's acceptable to ignore such phonetic items in your simple implementation. =============== --- XLSX.pm.orig 2009-01-26 16:02:19.000000000 +0900 +++ XLSX.pm 2009-02-13 01:52:19.000000000 +0900 @@ -12,6 +12,7 @@ use Spreadsheet::XLSX::Fmt2007; use Data::Dumper; use Spreadsheet::ParseExcel; +use CGI; ################################################################################ @@ -31,9 +32,11 @@ my $mstr = $member_shared_strings->contents; $mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag in the xml <t/> + $mstr =~ s%<rPh.*?>(.*?)</rPh>%%gsm; # ignores phonetic properties #foreach my $t ($member_shared_strings -> contents =~ /t\>([^\<]*)\<\/t/gsm) { foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) { + $t = CGI::unescapeHTML($t); $t = $converter -> convert ($t) if $converter; push @shared_strings, $t; =============== Regards, //---- Kazumasa Kotani e-mail: okina@is.s.u-tokyo.ac.jp