Bug #98548 for PDF-API2: hooks for line-splitting

Tue Sep 02 12:15:31 2014 philperry [...] hvc.rr.com - Ticket created

Subject:	hooks for line-splitting
Date:	Tue, 02 Sep 2014 12:13:47 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

PDF::API2 v2.022 Perl 5.16.3 Windows 7 severity: Wishlist Content.pm's text_fill_*() methods can currently only split a line at a space (x20) character. It would be good to be able to properly hyphenate words, to better fill a line. It's easy enough to split at camelCase, internal non-letters (hard hyphens, digits, punctuation), and at soft hyphens (&SHY;). It's fairly involved to properly split complete words, and different languages have different rules. I think that the first three cases could be implemented in the text_fill_*() methods, but we might have to pass control to a user-supplied routine for splitting of complete words.

Mon May 04 00:09:59 2015 steve [...] deefs.net - Correspondence added

econtrario contributed a patch to implement part of this a few months ago: https://bitbucket.org/ssimms/pdfapi2/pull-request/2/_text_fill_line-with-space-hyphen-and-soft/diff It needs some tests to be added.

Mon May 04 00:10:00 2015 The RT System itself - Status changed from 'new' to 'open'

Mon May 04 16:10:53 2015 philperry [...] hvc.rr.com - Correspondence added

Subject:	Re: [rt.cpan.org #98548] hooks for line-splitting
Date:	Mon, 4 May 2015 16:10:44 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	<philperry [...] hvc.rr.com>

Steve, I'm a bit concerned that the new code is using hard-coded single byte encoding for SHY and (?) xC2. xAD is SHY in Latin-1, but xC2 appears to be A+^, so I'm not sure what encoding this is. At any rate, before committing any new non-ASCII character handling code, I think we should decide how we want to handle various encodings. Splitting words will require knowing if we're in the middle of a single multibyte character. -- Phil P.S. have you given any thought to the other items I opened up 8 months ago? They're still tagged as 'new'. ---- Steve Simms via RT <bug-PDF-API2@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=98548 > > > econtrario contributed a patch to implement part of this a few months ago: > https://bitbucket.org/ssimms/pdfapi2/pull-request/2/_text_fill_line-with-space-hyphen-and-soft/diff > > It needs some tests to be added.

Sun Jan 24 16:41:01 2016 philperry [...] hvc.rr.com - Correspondence added

Subject:	[rt.cpan.org #98548]
Date:	Sun, 24 Jan 2016 16:40:58 -0500
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

Ah, I see I misread the code. It's apparently not two Latin-1 characters xC2 and xAD (one of which is SHY), but a UTF-8 representation of a SHY. A few comment lines in the code would have helped. Anyway, we still have the issue of what character encoding we're working in -- can we count on one particular encoding, or should we be able to handle a variety of encodings? Many, if not all, the font sets supplied for PDF appear to be in something close to Windows-1252 (more or less Latin-1), so can we even work with UTF-8 text? Before we embark on changes hard coded for one encoding or another, let's be clear what character encodings are even possible to use. To add a comment to this thread, just email bug-PDF-API2 [at] rt.cpan.org with subject line [rt.cpan.org #98548]. Note 1 space between org and #, and the [ ] around the whole subject. Nothing else. If you don't follow this format carefully, you will end up creating a new bug report! HTML formatting within the body of the comment does not work.

Wed Feb 17 16:49:59 2016 steve [...] deefs.net - Correspondence added

Given encoding issues and the complications of implementing hyphenation rules for multiple languages, this is something that's better left to an add-on module.

Wed Feb 17 16:50:00 2016 steve [...] deefs.net - Status changed from 'open' to 'rejected'

Thu Feb 18 15:34:33 2016 philperry [...] hvc.rr.com - Correspondence added

Subject:	[rt.cpan.org #98548]
Date:	Thu, 18 Feb 2016 15:34:39 -0500
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

Show quoted text

> Given encoding issues and the complications of implementing

hyphenation rules for multiple languages, this is something that's better left to an add-on module. True, but should we think about building in some simple word splitting scenarios? I would really like to split words after hyphens, but beyond that, it could get messy with non-ASCII characters. You don't want arbitrary (non-language sensitive) word splitting between accented Latin characters and ASCII letters, without being fully aware of the encoding used. You also don't want to end up accidentally splitting within a UTF-8 multibyte character. Em and en dashes, non-breaking spaces, soft hyphens, and various thickness space characters will depend on the encoding. ASCII characters and text are easy enough, but what to do about anything not ASCII? Perhaps allow splitting only between ASCII characters (0xxxxxxx byte) for now? It should be safe for multibyte UTF-8 characters, as all bytes for non-ASCII start with a 1 bit (1xxxxxxx). I think we could safely break between ASCII characters for hyphen and other non-letters in the range x21..x7E, and letters (letter to non-letter, or non-letter to letter transition, as well as lower-to-upper and upper-to-lower camelCase). Would that be useful? A dummy hook might be put in for future calling of user-supplied hyphenation routines for various encodings and languages, or just mark the spot in the code for now. For English, at least, a minimum of two characters must be left on each line, and be careful about not splitting something like O'Mallory into O'-Mallory, or thinking Ma is camelCase and splitting it O'M-allory. I'd sure like to get some other people participating in this discussion, to get some more viewpoints and algorithm experience. Perhaps we should just go ahead with starting the add-on module with the above simple algorithm, and flesh it out over time?