Bug #123562 for Text-PDF: TTFont0 generates broken CMap (patch incl.)

Wed Nov 08 08:26:35 2017 TPLA [...] cpan.org - Ticket created

Subject:

TTFont0 generates broken CMap (patch incl.)

When the ToUnicode option is supplied, Text::PDF::TTFont0::new() will generate a /ToUnicode CMap object that is broken in a few ways. This results in PDF files whose text can't be searched, copied or extracted. Here's the broken output: 1 stream 2 /CIDInit /ProcSet findresource being 12 dict begin begincmap 3 /CIDSystemInfo << /Registry (BXCJIM+ArialRegular+0) /Ordering (XYZ) 4 /Supplement 0 >> def 5 /CMapName /BXCJIM+ArialRegular+0 def 6 1 begincodespacerange <0001> <0d57> endcodespacerange 7 <0001> <0001> <0000> 8 9 ... 10 11 <0d57> <0d57> <20b8> 12 endbfrange 13 endcmap CMapName currendict /CMap defineresource pop end endendstream Notes: Line 5: " /CMapType 2 def" missing Line 7: "NNN beginbfrange" missing Line 13: "endendstream" missing newline between "end" and "endstream" Best regards, Thomas --- c:/usr/local/perl5/site/lib/Text/PDF/TTFont0.pm 2017-11-08 13:11:45.729893000 +0100 +++ c:/temp/Text-PDF-0.31/lib/Text/PDF/TTFont0.pm 2016-08-04 18:49:53.000000000 +0200 @@ -109,11 +109,9 @@ $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def +/CMapName /' . $self->{'BaseFont'}->val . '+0 def 1 begincodespacerange <'; $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); - $unistr .= $num - $i > 100 ? 100 : $num - $i; - $unistr .= " beginbfrange\n"; for ($i = 1; $i < $num; $i++) { if ($i % 100 == 0) @@ -124,7 +122,7 @@ } $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end\n"; + $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;

Wed Nov 08 08:34:51 2017 TPLA [...] cpan.org - Correspondence added

diff above is wrong headed. --- c:/temp/Text-PDF-0.31/lib/Text/PDF/TTFont0.pm 2016-08-04 18:49:53.000000000 +0200 +++ c:/usr/local/perl5/site/lib/Text/PDF/TTFont0.pm 2017-11-08 13:11:45.729893000 +0100 @@ -109,9 +109,11 @@ $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def +/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def 1 begincodespacerange <'; $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); + $unistr .= $num - $i > 100 ? 100 : $num - $i; + $unistr .= " beginbfrange\n"; for ($i = 1; $i < $num; $i++) { if ($i % 100 == 0) @@ -122,7 +124,7 @@ } $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; + $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end\n"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;

Wed Jul 17 02:23:59 2019 futuramedium [...] yandex.ru - Correspondence added

I came here looking to fix PDF::Reuse (see https://rt.cpan.org/Public/Bug/Display.html?id=123564). Apparently PDF viewers differ in how strictly they follow the spec w.r.t. text extraction, some are OK without patches, for some the above patch works. But it still is not good enough for Adobe Reader, either because of missed typo ('being') or off-by-one errors in 1st and last ranges. Patch below fixes that, it also makes things easier by making "codespacerange" all-inclusive (OK for Adobe - OK for everyone, see example in PDF Reference), replaces "bfrange" with "bfchar" (less verbose), and skips glyph-id's for which Unicode is not defined. These should have been "<FFFD>" instead of "<0000>" to begin with, but because they would be useless they better be omitted to reduce bloat. Ideally, the "ToUnicode" should be created at the time font is being subset, but that fix may be in future plans.

Subject:

TTFont0.pm.diff.txt

--- Text\PDF\TTFont0.pm.old Thu Aug 4 19:49:53 2016 +++ Text\PDF\TTFont0.pm Wed Jul 17 08:25:04 2019 @@ -106,23 +106,29 @@ if ($opt{'ToUnicode'}) { @rev = $font->{'cmap'}->read->reverse; - $unistr = '/CIDInit /ProcSet findresource being 12 dict begin begincmap + my $num = grep defined, @rev; + $unistr = '/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (' . $self->{'BaseFont'}->val . '+0) /Ordering (XYZ) /Supplement 0 >> def -/CMapName /' . $self->{'BaseFont'}->val . '+0 def -1 begincodespacerange <'; - $unistr .= sprintf("%04x> <%04x> endcodespacerange\n", 1, $num - 1); - for ($i = 1; $i < $num; $i++) +/CMapName /' . $self->{'BaseFont'}->val . '+0 def /CMapType 2 def +1 begincodespacerange <0000> <FFFF> endcodespacerange'."\n"; + for (my $i = my $j = 0; $i < @rev; $i++) { - if ($i % 100 == 0) + next unless defined $rev[$i]; + my $s = $num - $j > 100 ? 100 : $num - $j; + if ($j == 0) { - $unistr .= "endbfrange\n"; - $unistr .= $num - $i > 100 ? 100 : $num - $i; - $unistr .= " beginbfrange\n"; + $unistr .= "$s beginbfchar\n"; } - $unistr .= sprintf("<%04x> <%04x> <%04x>\n", $i, $i, $rev[$i]); + elsif ($j % 100 == 0) + { + $unistr .= "endbfchar\n"; + $unistr .= "$s beginbfchar\n"; + } + $unistr .= sprintf("<%04x> <%04x>\n", $i, $rev[$i]); + $j++; } - $unistr .= "endbfrange\nendcmap CMapName currendict /CMap defineresource pop end end"; + $unistr .= "endbfchar\nendcmap CMapName currendict /CMap defineresource pop end end\n"; $touni = PDFDict(); $parent->new_obj($touni); $touni->{' stream'} = $unistr;

Wed Jul 17 02:24:00 2019 The RT System itself - Status changed from 'new' to 'open'