Bug #79561 for Sereal-Decoder: Chunked decoding support

RT for rt.cpan.org

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Sereal-Decoder CPAN distribution.

Report information

The Basics

Id:	79561
Status:	open
Priority:	0/
Queue:	Sereal-Decoder

People

Owner:	Nobody in particular
Requestors:	dolmen [...] cpan.org
Cc:
AdminCc:

Bug Information

Severity:	Wishlist
Broken in:	(no value)
Fixed in:	(no value)

History Show all quoted text

Tue Sep 11 03:59:24 2012 dolmen [...] cpan.org - Ticket created

Subject:

Streaming support

For use in network protocol it would be helpful to have a chunked decoder. Requirements: - receive a buffer, consume data up to a single complete packet, and report if the packet is complete or not. - do not consume more that one packet - either: - consume (remove) the used data from the original buffer but keep the non used bytes - do not consume, but report the count of used bytes That could be useful to build an AnyEvent read_type (look for ::anyevent_read_type in AnyEvent::Handle). -- Olivier Mengué - http://perlresume.org/DOLMEN

Tue Sep 11 04:03:52 2012 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Streaming support
Date:	Tue, 11 Sep 2012 10:03:43 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 11 September 2012 09:59, Olivier Mengué via RT <bug-Sereal-Decoder@rt.cpan.org> wrote: Show quoted text

> Tue Sep 11 03:59:24 2012: Request 79561 was acted upon. > Transaction: Ticket created by DOLMEN > Queue: Sereal-Decoder > Subject: Streaming support > Broken in: (no value) > Severity: Wishlist > Owner: Nobody > Requestors: dolmen@cpan.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > > > For use in network protocol it would be helpful to have a chunked decoder. > > Requirements: > - receive a buffer, consume data up to a single complete packet, and > report if the packet is complete or not. > - do not consume more that one packet > - either: > - consume (remove) the used data from the original buffer but keep > the non used bytes > - do not consume, but report the count of used bytes > > > That could be useful to build an AnyEvent read_type (look for > ::anyevent_read_type in AnyEvent::Handle).

Hi, thanks, interesting. We will think about this. Some support for this is built in, but not all of it. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

Tue Sep 11 04:03:53 2012 The RT System itself - Status changed from 'new' to 'open'

Tue Sep 11 13:18:09 2012 smueller [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Streaming support
Date:	Tue, 11 Sep 2012 19:17:53 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	Steffen Mueller <smueller [...] cpan.org>

On 09/11/2012 09:59 AM, Olivier Mengué via RT wrote: Show quoted text

> Tue Sep 11 03:59:24 2012: Request 79561 was acted upon. > Transaction: Ticket created by DOLMEN > Queue: Sereal-Decoder

Show quoted text

> For use in network protocol it would be helpful to have a chunked decoder. > > Requirements: > - receive a buffer, consume data up to a single complete packet, and > report if the packet is complete or not. > - do not consume more that one packet > - either: > - consume (remove) the used data from the original buffer but keep > the non used bytes > - do not consume, but report the count of used bytes > > > That could be useful to build an AnyEvent read_type (look for > ::anyevent_read_type in AnyEvent::Handle).

Interesting. Well, within a given Sereal packet, we will never achieve "proper" streaming. Our back-reference logic may requires being able to jump forward and backward within the string buffer that holds the Sereal packet. But what's possible (already now) is concatenating multiple Sereal packets and then while(){} ing over them. To wit from the docs (with a bug fix): my @out; my $pos = 0; eval { while (1) { push @out, $decoder->decode_with_offset($sereal_string, 0); $pos += $decoder->bytes_consumed; last if $pos >= length($sereal_string) or not $decoder->bytes_consumed; } }; This is untested and obviously needs proper exception handling, etc. But the point is that you can provide an offset and determine the number of bytes consumed. --Steffen

Tue Sep 11 13:20:41 2012 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Streaming support
Date:	Tue, 11 Sep 2012 19:20:30 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 11 September 2012 19:18, Steffen Mueller via RT <bug-Sereal-Decoder@rt.cpan.org> wrote: Show quoted text

> Queue: Sereal-Decoder > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > On 09/11/2012 09:59 AM, Olivier Mengué via RT wrote:

>> Tue Sep 11 03:59:24 2012: Request 79561 was acted upon. >> Transaction: Ticket created by DOLMEN >> Queue: Sereal-Decoder

>> For use in network protocol it would be helpful to have a chunked decoder. >> >> Requirements: >> - receive a buffer, consume data up to a single complete packet, and >> report if the packet is complete or not. >> - do not consume more that one packet >> - either: >> - consume (remove) the used data from the original buffer but keep >> the non used bytes >> - do not consume, but report the count of used bytes >> >> >> That could be useful to build an AnyEvent read_type (look for >> ::anyevent_read_type in AnyEvent::Handle).

> > Interesting. Well, within a given Sereal packet, we will never achieve > "proper" streaming. Our back-reference logic may requires being able to > jump forward and backward within the string buffer that holds the Sereal > packet. But what's possible (already now) is concatenating multiple > Sereal packets and then while(){} ing over them. To wit from the docs > (with a bug fix): > > my @out; > my $pos = 0; > eval { > while (1) { > push @out, $decoder->decode_with_offset($sereal_string, 0); > $pos += $decoder->bytes_consumed; > last if $pos >= length($sereal_string) > or not $decoder->bytes_consumed; > } > }; > > This is untested and obviously needs proper exception handling, etc. But > the point is that you can provide an offset and determine the number of > bytes consumed.

We could provide a way to passthrough length data in the header, which would allow us to know when a packet has been read. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

Wed Sep 12 02:01:11 2012 dolmen [...] cpan.org - Correspondence added

Subject:	Chunked decoding support
RT-Send-CC:	smueller [...] cpan.org, demerphq [...] gmail.com

Le 2012-09-11 19:18:09, SMUELLER a écrit : Show quoted text

> Interesting. Well, within a given Sereal packet, we will never achieve > "proper" streaming.

That's not what I'm asking for. Show quoted text

> Our back-reference logic may requires being able to > jump forward and backward within the string buffer that holds the Sereal > packet.

That is fine. At least the serialisation protocol should allows a decoder to be able to detect packet boundaries without relying only on the size of the input buffer. Is Sereal clean on that matter? Show quoted text

> But what's possible (already now) is concatenating multiple > Sereal packets and then while(){} ing over them. To wit from the docs > (with a bug fix): > > my @out; > my $pos = 0; > eval { > while (1) { > push @out, $decoder->decode_with_offset($sereal_string, 0); > $pos += $decoder->bytes_consumed; > last if $pos >= length($sereal_string) > or not $decoder->bytes_consumed; > } > };

What I'm asking for is for a stateful decoder that works when $sereal_string is not a complete Sereal buffer and that, when fed, can report if it got too few or too much data. When it has a full packet it can return it decoded. This could allow to put in a binary stream various packets encoded using different encoders (the application writing/reading the stream is responsible for calling successively each decoder as appropriate). AnyEvent supports this with JSON (thanks to JSON::XS) has this capability. MessagePack has some partial support (Data::MessagePack::Stream), but it is incomplete (it doesn't report the consumed bytes, so MessagePack packets can't be mixed with other packets). -- Olivier Mengué - http://perlresume.org/DOLMEN

Wed Sep 12 02:01:41 2012 dolmen [...] cpan.org - Subject changed from 'Streaming support' to 'Chunked decoding support'

Wed Sep 12 02:39:39 2012 smueller [...] cpan.org - Correspondence added

I believe that the Sereal decoder should mostly do what you want if I understand you correctly. Given the example code I pasted, you'll get an exception from Sereal::Decoder on invalid input data/premature end of packet. Now, we could try to at least record the approximate position in the buffer when an exception occurs. But I doubt that we will be able to make the decoder signal to you that there's data missing and then restart from where it left off with the next chunk. That would be exceedingly difficult. What we could do, however, is to embed the length of the packet in the header at a fixed position (but optional, based on a flag which is in a fixed position after the magic string) or something like that. This would allow you to assert that you read enough data to deserialize. The practical problem that this poses to the encoder is that it has to go back to the beginning of the buffer and insert the length of the output (which it only knows after it's done serializing). This is easy to do if we reserve a fixed four or eight bytes for the length, but hard (or slow) if we want a variable-length (ie. compact) integer representation for the packet length.

Wed Sep 12 02:47:13 2012 smueller [...] cpan.org - Correspondence added

Oh, as for embedding packet lengths: Doing that is really easy in your application/protocol. It's simply doing very basic framing (which you probably need to do anyway).

Wed Sep 12 02:48:11 2012 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Chunked decoding support
Date:	Wed, 12 Sep 2012 08:48:01 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 12 September 2012 08:39, Steffen Mueller via RT <bug-Sereal-Decoder@rt.cpan.org> wrote: Show quoted text

> Queue: Sereal-Decoder > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > I believe that the Sereal decoder should mostly do what you want if I > understand you correctly. Given the example code I pasted, you'll get an > exception from Sereal::Decoder on invalid input data/premature end of > packet. Now, we could try to at least record the approximate position in > the buffer when an exception occurs. But I doubt that we will be able to > make the decoder signal to you that there's data missing and then > restart from where it left off with the next chunk. That would be > exceedingly difficult.

Actually I don't know that that is really the case. It would be time consuming to do in that it would require a complete rewrite of the current decoder but I think "exceedingly difficult" is a bit strong. Anyway, it definitely is not on MY todo list. Show quoted text

> What we could do, however, is to embed the length of the packet in the > header at a fixed position (but optional, based on a flag which is in a > fixed position after the magic string) or something like that. This > would allow you to assert that you read enough data to deserialize. > > The practical problem that this poses to the encoder is that it has to > go back to the beginning of the buffer and insert the length of the > output (which it only knows after it's done serializing). This is easy > to do if we reserve a fixed four or eight bytes for the length, but hard > (or slow) if we want a variable-length (ie. compact) integer > representation for the packet length.

Considering that the current encoder does not stream to secondary media its actually not that difficult to make this reasonably fast. Another option would be to support a validation mode in the decoder. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

Wed Sep 12 03:05:46 2012 smueller [...] cpan.org - Correspondence added

On Wed Sep 12 02:48:11 2012, demerphq@gmail.com wrote: Show quoted text

> Actually I don't know that that is really the case. It would be time > consuming to do in that it would require a complete rewrite of the > current decoder but I think "exceedingly difficult" is a bit strong. > Anyway, it definitely is not on MY todo list.

Fair point. And it's not high on my list either. Show quoted text

> Considering that the current encoder does not stream to secondary > media its actually not that difficult to make this reasonably fast. > > Another option would be to support a validation mode in the decoder.

Yeah, I guess "slow" and "fast" are rather relative terms here. How slow is it to do a single memcpy of a medium-size string? Probably not really all that slow. Doing any better than that is tricky, though: In Perl, there's the offset hack that can help avoid the memcpy, but we'll be damning tentative other encoder implementations that don't have this trick.

Wed Sep 12 03:17:31 2012 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Chunked decoding support
Date:	Wed, 12 Sep 2012 09:17:21 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 12 September 2012 09:05, Steffen Mueller via RT <bug-Sereal-Decoder@rt.cpan.org> wrote: Show quoted text

> Queue: Sereal-Decoder > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > On Wed Sep 12 02:48:11 2012, demerphq@gmail.com wrote:

>> Actually I don't know that that is really the case. It would be time >> consuming to do in that it would require a complete rewrite of the >> current decoder but I think "exceedingly difficult" is a bit strong. >> Anyway, it definitely is not on MY todo list.

> > Fair point. And it's not high on my list either. >

>> Considering that the current encoder does not stream to secondary >> media its actually not that difficult to make this reasonably fast. >> >> Another option would be to support a validation mode in the decoder.

> > Yeah, I guess "slow" and "fast" are rather relative terms here. How slow > is it to do a single memcpy of a medium-size string? Probably not really > all that slow. Doing any better than that is tricky, though: In Perl, > there's the offset hack that can help avoid the memcpy, but we'll be > damning tentative other encoder implementations that don't have this > trick.

IMO we aren't really damming them. Most languages will have a way to do the same kind of "pointer into a buffer" trick. At its simplest form you wrap a larger buffer in an object with an offset that use with substr() (think about how you would solve this in Perl without exploiting OOK.) But I dont think its up to us to solve this problem in any language other than C and Perl. Those implementing this in other languages will figure stuff out I am very sure. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

Fri Sep 14 03:51:32 2012 MAXS [...] cpan.org - Correspondence added

Le Mer 12 Sep 2012 02:39:39, SMUELLER a écrit : Show quoted text

> I believe that the Sereal decoder should mostly do what you want if I > understand you correctly. Given the example code I pasted, you'll get an > exception from Sereal::Decoder on invalid input data/premature end of > packet. Now, we could try to at least record the approximate position in > the buffer when an exception occurs. But I doubt that we will be able to > make the decoder signal to you that there's data missing and then > restart from where it left off with the next chunk. That would be > exceedingly difficult. > > What we could do, however, is to embed the length of the packet in the > header at a fixed position (but optional, based on a flag which is in a > fixed position after the magic string) or something like that. This > would allow you to assert that you read enough data to deserialize. > > The practical problem that this poses to the encoder is that it has to > go back to the beginning of the buffer and insert the length of the > output (which it only knows after it's done serializing). This is easy > to do if we reserve a fixed four or eight bytes for the length, but hard > (or slow) if we want a variable-length (ie. compact) integer > representation for the packet length.

After reading your explanations, if we can't distinguish the premature end of data from the encoding error, I think it would be better to add the length at the AnyEvent (or any other "streamer") level, like it does for the Storable case with a pack('w', $length). Note that POE does the same but using "$length\0...". It would be easier for everyone, streamers and packers... :) Max.

Fri Sep 14 07:21:54 2012 MAXS [...] cpan.org - Correspondence added

Implementation of AnyEvent::Sereal : https://github.com/maxatome/p5-anyevent-sereal using the same way as Storable. Regards, Max.

Fri Sep 14 09:58:38 2012 smueller [...] cpan.org - Correspondence added

RT-Send-CC:

demerphq [...] gmail.com

On Fri Sep 14 07:21:54 2012, MAXS wrote: Show quoted text

> Implementation of AnyEvent::Sereal : > https://github.com/maxatome/p5-anyevent-sereal using the same way as > Storable.

That's awesome, thank you very much! Minor comments: You'll improve performance if you use the OO interface since that can reuse various structures and options. Furthermore, you might want to give users a way to provide options for the serializer and deserializer. For example, Sereal supports Snappy compression and whether or not you need it depends on your particular application. Thanks again for using our software. If you have any feedback, we'd be very interested in hearing it. Best regards, Steffen

Mon Sep 17 03:05:53 2012 MAXS [...] cpan.org - Correspondence added

RT-Send-CC:

smueller [...] cpan.org, demerphq [...] gmail.com

Le Ven 14 Sep 2012 09:58:38, SMUELLER a écrit : Show quoted text

> On Fri Sep 14 07:21:54 2012, MAXS wrote:

> > Implementation of AnyEvent::Sereal : > > https://github.com/maxatome/p5-anyevent-sereal using the same way as > > Storable.

> > That's awesome, thank you very much! > > Minor comments: You'll improve performance if you use the OO interface > since that can reuse various structures and options. Furthermore, you > might want to give users a way to provide options for the serializer and > deserializer. For example, Sereal supports Snappy compression and > whether or not you need it depends on your particular application.

I did it, and you are right it is particularly faster. :) https://metacpan.org/module/AnyEvent::Sereal Show quoted text

> > Thanks again for using our software. If you have any feedback, we'd be > very interested in hearing it.

One thing at this time: it could be cool to have croak_on_bless option on the decoder side. If the source of the stream can not be trusted, it could be fine to avoid instantiation of objects when decoding it... Best regards, Max.

Mon Sep 17 03:23:55 2012 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #79561] Chunked decoding support
Date:	Mon, 17 Sep 2012 09:23:41 +0200
To:	bug-Sereal-Decoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 17 September 2012 09:05, Maxime Soulé via RT <bug-Sereal-Decoder@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > Le Ven 14 Sep 2012 09:58:38, SMUELLER a écrit :

>> On Fri Sep 14 07:21:54 2012, MAXS wrote:

>> > Implementation of AnyEvent::Sereal : >> > https://github.com/maxatome/p5-anyevent-sereal using the same way as >> > Storable.

>> >> That's awesome, thank you very much! >> >> Minor comments: You'll improve performance if you use the OO interface >> since that can reuse various structures and options. Furthermore, you >> might want to give users a way to provide options for the serializer and >> deserializer. For example, Sereal supports Snappy compression and >> whether or not you need it depends on your particular application.

> > I did it, and you are right it is particularly faster. :) > > https://metacpan.org/module/AnyEvent::Sereal >

>> >> Thanks again for using our software. If you have any feedback, we'd be >> very interested in hearing it.

> > One thing at this time: it could be cool to have croak_on_bless option > on the decoder side. If the source of the stream can not be trusted, it > could be fine to avoid instantiation of objects when decoding it...

I thought we already had this? If not it is definitely on our todo list. BTW, we will NOT bless ANY objects until we have completely deserialized the data structure with no errors.. IOW, we more or less guarantee that no methods, such as DESTROY, will be called on the object during deserialization. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

Mon Sep 17 07:21:53 2012 smueller [...] cpan.org - Correspondence added

On Mon Sep 17 03:23:55 2012, demerphq@gmail.com wrote: Show quoted text

> On 17 September 2012 09:05, Maxime Soulé via RT > <bug-Sereal-Decoder@rt.cpan.org> wrote:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=79561 > > > > > One thing at this time: it could be cool to have croak_on_bless

option Show quoted text

> > on the decoder side. If the source of the stream can not be trusted,

it Show quoted text

> > could be fine to avoid instantiation of objects when decoding it...

> > I thought we already had this? If not it is definitely on our todo

list. Show quoted text

> > BTW, we will NOT bless ANY objects until we have completely > deserialized the data structure with no errors.. IOW, we more or less > guarantee that no methods, such as DESTROY, will be called on the > object during deserialization.

We didn't. Ooops. Oversight. This check was available only on the encoder side since that's what WE need at work. I just pushed a change to implement such an option in the decoder. Will be in the next release. Related note: Please report issues in a separate bug report/whatever. Having a giant unrelated thread in the "Chunked decoding support" ticket doesn't help. :) Best regards, Steffen