Bug #84549 for Unicode-LineBreak: String trimming

Wed Apr 10 07:05:40 2013 MARKOV [...] cpan.org - Ticket created

Subject:

String trimming

Hi Nezumi, I really like your complex module ;-) What I have is a sprintf() extension in Log::Report::Message. I will base its %s printing on Unicode::CGString. It is not difficult to support "%10s", however how do I implement "%.10s" fast? A slow way would be: chop $string while Unicode::CGString->new($string)->columns > 10; Better (untested as well) my $r = Unicode::CGString->new($string); $r->substr(-1) while $r->columns > 10; "$r"; Do you have a fast solution for me which implements it in 1 call? Something like: $r->trim(10);

Wed Apr 10 23:20:31 2013 hatuka [...] nezumi.nu - Correspondence added

Hi Mark, On 2013-4月-10 水 07:05:40, MARKOV wrote: Show quoted text

> Hi Nezumi, > > I really like your complex module ;-) > > What I have is a sprintf() extension in Log::Report::Message. I will > base its %s printing on Unicode::CGString. It is not difficult to > support "%10s", however how do I implement "%.10s" fast? > > A slow way would be: > > chop $string while Unicode::CGString->new($string)->columns > 10; > > Better (untested as well) > > my $r = Unicode::CGString->new($string); > $r->substr(-1) while $r->columns > 10; > "$r"; > > Do you have a fast solution for me which implements it in 1 call? > Something like: > > $r->trim(10);

A straight way may be: sub trim { my $self = shift; my $len = shift; my $ret = ''; foreach my $gc (@$self) { last if $len < ($ret . $gc)->columns; $ret .= $gc; } return "$ret"; } '@', '.' and '""' are overloaded operators. It may be a bit faster. That's all I come up with at present. Thanks, --- nezumi

Wed Apr 10 23:20:31 2013 The RT System itself - Status changed from 'new' to 'open'

Thu Apr 11 03:52:59 2013 secretaris [...] nluug.nl - Correspondence added

Subject:	Re: [rt.cpan.org #84549] String trimming
Date:	Thu, 11 Apr 2013 09:52:38 +0200
To:	Hatuka*nezumi - IKEDA Soji via RT <bug-Unicode-LineBreak [...] rt.cpan.org>
From:	Mark Overmeer <secretaris [...] nluug.nl>

* Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130411 03:20]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=84549 >

> > $r->trim(10);

> > A straight way may be: > > sub trim { > my $self = shift; > my $len = shift; > > my $ret = ''; > foreach my $gc (@$self) { > last if $len < ($ret . $gc)->columns; > $ret .= $gc; > } > return "$ret"; > } > > '@', '.' and '""' are overloaded operators. > It may be a bit faster. That's all I come up with at present.

It looks to me as an algorithm close to line-folding, only as first step, because there is no need for word boundary. I have the impression that above algorithm could be quite slow, especially when strings get large. On websites, you often see the first few lines of an article, say 500 columns worth. So, the function I suggest could be more useful than just my implementation of %10.10s. Am I right to suspect 'columns' to be quite expensive? O(n) Efficient would be something along the lines of: sub trim($) { my ($self, $max) = @_; my $pos = $self->char_on($max); $self->substr($pos) = '' if defined $pos; $self; } # probably off-by-one errors :( Do we start counting column by 0 or 1? # return undef when outside string. sub char_on($) { my ($self, $col) = @_; my $taken_col = 0; my $pos; for($pos = 0; $pos <= $#self && $taken_col < $col; $pos++) { $taken_col += $self[$pos]->columns; } $taken_col < $col ? undef : $pos; } It seems to me, this is much more efficient in C than via Perl's tied interface. That's why I put it on the wishlist of your nice module. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Apr 11 08:36:38 2013 hatuka [...] nezumi.nu - Correspondence added

On 2013-4月-11 木 03:52:59, secretaris@nluug.nl wrote: Show quoted text

> * Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode- > LineBreak@rt.cpan.org) [130411 03:20]:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=84549 >

> > > $r->trim(10);

> > > > A straight way may be: > > > > sub trim { > > my $self = shift; > > my $len = shift; > > > > my $ret = ''; > > foreach my $gc (@$self) { > > last if $len < ($ret . $gc)->columns; > > $ret .= $gc; > > } > > return "$ret"; > > } > > > > '@', '.' and '""' are overloaded operators. > > It may be a bit faster. That's all I come up with at present.

> > It looks to me as an algorithm close to line-folding, only as first > step, because there is no need for word boundary. > > I have the impression that above algorithm could be quite slow, > especially > when strings get large. On websites, you often see the first few > lines of > an article, say 500 columns worth. So, the function I suggest could > be > more useful than just my implementation of %10.10s. > > Am I right to suspect 'columns' to be quite expensive? O(n) > Efficient would be something along the lines of: > > sub trim($) { > my ($self, $max) = @_; > my $pos = $self->char_on($max); > $self->substr($pos) = '' if defined $pos; > $self; > } > > # probably off-by-one errors :( Do we start counting column by 0 > or 1? > # return undef when outside string. > sub char_on($) { > my ($self, $col) = @_; > > my $taken_col = 0; > my $pos; > for($pos = 0; $pos <= $#self && $taken_col < $col; $pos++) > { $taken_col += $self[$pos]->columns; > } > $taken_col < $col ? undef : $pos; > } > > It seems to me, this is much more efficient in C than via Perl's tied > interface. That's why I put it on the wishlist of your nice module.

How about next() method? It seems approximately O(1). sub trim { my $self = shift; my $len = shift; return '' if $len <= 0; my $pos = $self->pos; $self->pos(0); my $cols = 0; my $gc; while (defined($gc = $self->next)) { if ($len < ($cols += $gc->columns)) { my $ret = $self->substr(0, $self->pos - 1); $self->pos($pos); return $ret; } } $self->pos($pos); return $self; }

Thu Apr 11 10:09:00 2013 secretaris [...] nluug.nl - Correspondence added

Subject:	Re: [rt.cpan.org #84549] String trimming
Date:	Thu, 11 Apr 2013 16:08:09 +0200
To:	Hatuka*nezumi - IKEDA Soji via RT <bug-Unicode-LineBreak [...] rt.cpan.org>
From:	Mark Overmeer <secretaris [...] nluug.nl>

* Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130411 12:36]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=84549 >

> > sub trim($) { > > my ($self, $max) = @_; > > my $pos = $self->char_on($max); > > $self->substr($pos) = '' if defined $pos; > > $self; > > } > > > > # probably off-by-one errors :( Do we start counting column by 0 or 1? > > # return undef when outside string. > > sub char_on($) { > > my ($self, $col) = @_; > > > > my $taken_col = 0; > > my $pos; > > for($pos = 0; $pos <= $#self && $taken_col < $col; $pos++) > > { $taken_col += $self[$pos]->columns; > > } > > $taken_col < $col ? undef : $pos; > > } > > > > It seems to me, this is much more efficient in C than via Perl's tied > > interface. That's why I put it on the wishlist of your nice module.

> > How about next() method? It seems approximately O(1). > > sub trim { > my $self = shift; > my $len = shift; > return '' if $len <= 0; > > my $pos = $self->pos; > $self->pos(0); > > my $cols = 0; > my $gc; > while (defined($gc = $self->next)) { > if ($len < ($cols += $gc->columns)) { > my $ret = $self->substr(0, $self->pos - 1); > $self->pos($pos); > return $ret; > } > } > > $self->pos($pos); > return $self; > } >

That's the iterative version of my last example. You do not think that a pure XS implementation for this is useful? -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Apr 11 21:57:55 2013 hatuka [...] nezumi.nu - Correspondence added

Subject:	Re: [rt.cpan.org #84549] String trimming
Date:	Fri, 12 Apr 2013 10:57:21 +0900
To:	bug-Unicode-LineBreak [...] rt.cpan.org
From:	Hatuka*nezumi - IKEDA Soji <hatuka [...] nezumi.nu>

On Thu, 11 Apr 2013 10:09:01 -0400 "Mark Overmeer via RT" <bug-Unicode-LineBreak@rt.cpan.org> wrote: Show quoted text

> Queue: Unicode-LineBreak > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84549 > > > * Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130411 12:36]:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=84549 >

> > > sub trim($) { > > > my ($self, $max) = @_; > > > my $pos = $self->char_on($max); > > > $self->substr($pos) = '' if defined $pos; > > > $self; > > > } > > > > > > # probably off-by-one errors :( Do we start counting column by 0 or 1? > > > # return undef when outside string. > > > sub char_on($) { > > > my ($self, $col) = @_; > > > > > > my $taken_col = 0; > > > my $pos; > > > for($pos = 0; $pos <= $#self && $taken_col < $col; $pos++) > > > { $taken_col += $self[$pos]->columns; > > > } > > > $taken_col < $col ? undef : $pos; > > > } > > > > > > It seems to me, this is much more efficient in C than via Perl's tied > > > interface. That's why I put it on the wishlist of your nice module.

> > > > How about next() method? It seems approximately O(1). > > > > sub trim { > > my $self = shift; > > my $len = shift; > > return '' if $len <= 0; > > > > my $pos = $self->pos; > > $self->pos(0); > > > > my $cols = 0; > > my $gc; > > while (defined($gc = $self->next)) { > > if ($len < ($cols += $gc->columns)) { > > my $ret = $self->substr(0, $self->pos - 1); > > $self->pos($pos); > > return $ret; > > } > > } > > > > $self->pos($pos); > > return $self; > > } > >

> > That's the iterative version of my last example.

Yes, and it doesn't extract GCString to array. Show quoted text

> You do not think that a pure XS implementation for this is useful?

I had thought of similar thing. I believe XS version should have slightly generalized function: It would like to support left-side trim along with right-side trim and so on. Since it will not be in time for next stable release (this May to June), it would be better to use Perl version for the present. Regards, -- --- nezumi

Fri Apr 12 02:17:05 2013 secretaris [...] nluug.nl - Correspondence added

Subject:	Re: [rt.cpan.org #84549] String trimming
Date:	Fri, 12 Apr 2013 08:16:27 +0200
To:	Hatuka*nezumi - IKEDA Soji via RT <bug-Unicode-LineBreak [...] rt.cpan.org>
From:	Mark Overmeer <secretaris [...] nluug.nl>

* Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130412 01:58]: Show quoted text

> > You do not think that a pure XS implementation for this is useful?

> > I had thought of similar thing. I believe XS version should have > slightly generalized function: It would like to support left-side > trim along with right-side trim and so on.

Starts with the generic "char_on_column()" Then any kind of trim is easy. Show quoted text

> Since it will not be in time for next stable release (this May to > June), it would be better to use Perl version for the present.

Sure. Thanks! -- MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net