Subject: | Unicode text prints text on top of text before it |
Here's a demonstration of the bug: the test script (bug-demo.pl)
produces a PDF file with the Polish text (part of an address) "Centrum
Uslug Ksiegowych", with modifications on the "l", and on the "e" in the
third word. Look at bug-demo.release.pdf for the PDF with the release
version of PDF::API2 (0.71.001); look at bug-demo.patched.pdf for the
PDF after my patch is applied.
#!/usr/bin/perl -w
use PDF::API2;
use strict;
gen_pdf("$0.pdf");
sub gen_pdf {
my($save_as) = @_;
my $api = PDF::API2->new();
my $uf = unifont($api, 'Times', 1);
$api->mediabox(595,842);
my $page = $api->page;
my $text = $page->text;
$text->font( $uf, 18 );
$text->translate( 190, 400 );
$text->paragraph("Centrum Us\x{0142}ug Ksi\x{0119}gowych", 220, 25);
$api->saveas($save_as);
$api->end;
}
sub unifont {
my($api, $fontname, @blk) = @_;
return $api->unifont(
$api->corefont($fontname, -encode=>'latin1'),
map([ $api->corefont($fontname, -encode=>"uni$_"), [$_] ],
@blk ),
-encode => 'latin1'
);
}
The patch (PDF-API2-Resource-Font.pm.patch) is to add one line in the
file PDF/API2/Resource/Font.pm
$data->{firstchar} = 0;
to set this value to zero if $encoding matches /^uni\d+$/.
You can also simply replace the existing module file with the one I
attached. (PDF-API2-Resource-Font.pm.tar.gz) (for PDF:API2 0.71.001).
Some background:
PDF::API2::Resource::UniFont uses a faked font for character sets with
more than 256 characters (actually 224, when ignoring control
characters). It works by mapping blocks of 256 bytes in Unicode
("block", "page", "plane") to a single byte font that contains just the
characters in the font for this block. For example, the Unicode range
0x100 to 0x1FF is remapped to the single byte range 0x00 to 0xFF, in the
pseudo-font associated with block 1.
The problem is that for the first 32 characters in these blocks, the
print width is not stored, and as a result, the PDF rendering engine
treats the widths for these characters as zero. That is the case for the
"e" ("e ogonek"), which is chr(281) in Unicode and gets remapped to
chr(25) in the single byte font, and which (as 25 < 32) gets a zero
width. That's why the following "g" is printed on top of it.
The "l" ("l slash") is chr(322) and gets remapped to a chr(66), so it
behaves normal, as it has its proper width stored.
The patch simply tells PDF::API2 that for these remapped fonts, it
should treat *every* character for all character codes from 0 to 255, as
a normal character, instead of just the default limited range 32 to 255.
As a result, the *complete* character width table, with 256 entries,
gets now stored in the PDF file. And that fixes it.
Subject: | bug-demo.release.pdf |
Message body not shown because it is not plain text.
Subject: | PDF-API2-Resource-Font.pm.tar.gz |
Message body not shown because it is not plain text.
Subject: | PDF-API2-Resource-Font.pm.patch |
--- old/PDF/API2/Resource/Font.pm Sat Mar 10 14:05:42 2007
+++ PDF/API2/Resource/Font.pm Fri Oct 31 13:48:28 2008
@@ -73,6 +73,7 @@
my $blk=$1;
$data->{e2u}=[ map { $blk*256+$_ } (0..255) ];
$data->{e2n}=[ map { nameByUni($_) || '.notdef' } @{$data->{e2u}} ];
+ $data->{firstchar} = 0;
}
elsif(defined $encoding)
{
Subject: | bug-demo.pl |
#!/usr/bin/perl -w
use PDF::API2;
use strict;
gen_pdf("$0.pdf");
sub gen_pdf {
my($save_as) = @_;
my $api = PDF::API2->new();
my $uf = unifont($api, 'Times', 1);
$api->mediabox(595,842);
my $page = $api->page;
my $text = $page->text;
$text->font( $uf, 18 );
$text->translate( 190, 400 );
$text->paragraph("Centrum Us\x{0142}ug Ksi\x{0119}gowych", 220, 25);
$api->saveas($save_as);
$api->end;
}
sub unifont {
my($api, $fontname, @blk) = @_;
return $api->unifont(
$api->corefont($fontname, -encode=>'latin1'),
map([ $api->corefont($fontname, -encode=>"uni$_"), [$_] ], @blk ),
-encode => 'latin1'
);
}
Subject: | bug-demo.patched.pdf |
Message body not shown because it is not plain text.