Subject: | two problems in treating shared string table |
Date: | Fri, 13 Feb 2009 11:24:28 +0900 |
To: | bug-Spreadsheet-XLSX [...] rt.cpan.org |
From: | okina [...] is.s.u-tokyo.ac.jp |
Hi,
Trying to load the excel 2007 file, I encountered two problems below.
So I send you a patch.
My environment is:
* Spreadsheet-XLSX-0.09
* Linux version 2.6.9-023stab040.1-enterprise (root@rhel4-32)
(gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP
Mon Jan 15 22:56:55 MSK 2007
* perl, v5.8.5 built for i386-linux-thread-multi.
1)
The loaded context includes character entity references literally.
2)
Due to existence of 'Phonetic Properties' items for Japanese excel files,
Spreadsheet::XLSX misaligns the indices of items in the shared string table.
Phonetic items represents pronunciation hints for some East Asian languages.
In the file 'xl/sharedStrings.xml', the phonetic properties appear like:
<si>
<t>(a japanese text in KANJI)</t>
<rPh sb="0" eb="1">
<t>(its pronounciation in KATAKANA)</t>
</rPh>
</si>
Then, the routine in Spreadsheet::XLSX::new(),
foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) ,
wrongly extracts the phonetic items as normal string items,
by only searching '<t>' tag.
This problem is not a special case, but may express at many XLSX files
created by Japanese version of Excel, because phonetic properties
are inserted automatically by Excel(and IME).
* See details for the file formats of OOXML in:
http://www.ecma-international.org/publications/standards/Ecma-376.htm
The section '1st edition Part 4' states its markup language reference.
According to the reference, this problem can be caused only by '<rPh>' tags.
Therefore, I wrote a simple patch for fixing these bugs.
Note that I think that it's acceptable to ignore such phonetic items
in your simple implementation.
===============
--- XLSX.pm.orig 2009-01-26 16:02:19.000000000 +0900
+++ XLSX.pm 2009-02-13 01:52:19.000000000 +0900
@@ -12,6 +12,7 @@
use Spreadsheet::XLSX::Fmt2007;
use Data::Dumper;
use Spreadsheet::ParseExcel;
+use CGI;
################################################################################
@@ -31,9 +32,11 @@
my $mstr = $member_shared_strings->contents;
$mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag in the xml <t/>
+ $mstr =~ s%<rPh.*?>(.*?)</rPh>%%gsm; # ignores phonetic properties
#foreach my $t ($member_shared_strings -> contents =~ /t\>([^\<]*)\<\/t/gsm) {
foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) {
+ $t = CGI::unescapeHTML($t);
$t = $converter -> convert ($t) if $converter;
push @shared_strings, $t;
===============
Regards,
//----
Kazumasa Kotani
e-mail: okina@is.s.u-tokyo.ac.jp