Skip Menu |

This queue is for tickets about the Text-PDF CPAN distribution.

Report information
The Basics
Id: 123562
Status: open
Priority: 0/
Queue: Text-PDF

People
Owner: Nobody in particular
Requestors: TPLA [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: TTFont0 generates broken CMap (patch incl.)
When the ToUnicode option is supplied, Text::PDF::TTFont0::new() will generate a /ToUnicode CMap object that is broken in a few ways. This results in PDF files whose text can't be searched, copied or extracted. Here's the broken output: 1 stream 2 /CIDInit /ProcSet findresource being 12 dict begin begincmap 3 /CIDSystemInfo << /Registry (BXCJIM+ArialRegular+0) /Ordering (XYZ) 4 /Supplement 0 >> def 5 /CMapName /BXCJIM+ArialRegular+0 def 6 1 begincodespacerange <0001> <0d57> endcodespacerange 7 <0001> <0001> <0000> 8 9 ... 10 11 <0d57> <0d57> <20b8> 12 endbfrange 13 endcmap CMapName currendict /CMap defineresource pop end endendstream Notes: Line 5: " /CMapType 2 def" missing Line 7: "NNN beginbfrange" missing Line 13: "endendstream" missing newline between "end" and "endstream" Best regards, Thomas --- c:/usr/local/perl5/site/lib/Text/PDF/TTFont0.pm 2017-11-08 13:11:45.729893000 +0100 +++ c:/temp/Text-PDF-0.31/lib/Text/PDF/TTFont0.pm 2016-08-04 18:49:53.000000000 +0200 @@ -109,11 +109,9 @@ $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def +/CMapName /' . $self->{'BaseFont'}->val . '+0 def 1 begincodespacerange <'; $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); - $unistr .= $num - $i > 100 ? 100 : $num - $i; - $unistr .= " beginbfrange\n"; for ($i = 1; $i < $num; $i++) { if ($i % 100 == 0) @@ -124,7 +122,7 @@ } $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end\n"; + $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;
diff above is wrong headed. --- c:/temp/Text-PDF-0.31/lib/Text/PDF/TTFont0.pm 2016-08-04 18:49:53.000000000 +0200 +++ c:/usr/local/perl5/site/lib/Text/PDF/TTFont0.pm 2017-11-08 13:11:45.729893000 +0100 @@ -109,9 +109,11 @@ $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def +/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def 1 begincodespacerange <'; $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); + $unistr .= $num - $i > 100 ? 100 : $num - $i; + $unistr .= " beginbfrange\n"; for ($i = 1; $i < $num; $i++) { if ($i % 100 == 0) @@ -122,7 +124,7 @@ } $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; + $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end\n"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;
I came here looking to fix PDF::Reuse (see https://rt.cpan.org/Public/Bug/Display.html?id=123564). Apparently PDF viewers differ in how strictly they follow the spec w.r.t. text extraction, some are OK without patches, for some the above patch works. But it still is not good enough for Adobe Reader, either because of missed typo ('being') or off-by-one errors in 1st and last ranges. Patch below fixes that, it also makes things easier by making "codespacerange" all-inclusive (OK for Adobe - OK for everyone, see example in PDF Reference), replaces "bfrange" with "bfchar" (less verbose), and skips glyph-id's for which Unicode is not defined. These should have been "<FFFD>" instead of "<0000>" to begin with, but because they would be useless they better be omitted to reduce bloat. Ideally, the "ToUnicode" should be created at the time font is being subset, but that fix may be in future plans.
Subject: TTFont0.pm.diff.txt
--- Text\PDF\TTFont0.pm.old Thu Aug 4 19:49:53 2016 +++ Text\PDF\TTFont0.pm Wed Jul 17 08:25:04 2019 @@ -106,23 +106,29 @@ if ($opt{'ToUnicode'}) { @rev = $font->{'cmap'}->read->reverse; - $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap + my $num = grep defined, @rev; + $unistr = '/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def -1 begincodespacerange <'; - $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); - for ($i = 1; $i < $num; $i++) +/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def +1 begincodespacerange <0000> <FFFF> endcodespacerange'."\n"; + for (my $i = my $j = 0; $i < @rev; $i++) { - if ($i % 100 == 0) + next unless defined $rev[$i]; + my $s = $num - $j > 100 ? 100 : $num - $j; + if ($j == 0) { - $unistr .= "endbfrange\n"; - $unistr .= $num - $i > 100 ? 100 : $num - $i; - $unistr .= " beginbfrange\n"; + $unistr .= "$s beginbfchar\n"; } - $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); + elsif ($j % 100 == 0) + { + $unistr .= "endbfchar\n"; + $unistr .= "$s beginbfchar\n"; + } + $unistr .= sprintf("<%04x> <%04x>\n", $i, $rev[$i]); + $j++; } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; + $unistr .= "endbfchar\nendcmap CMapName currendict /CMap defineresource pop end end\n"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;