Bug #101881 for Sereal-Encoder: floating-point formats not according to spec

Mon Feb 02 10:15:07 2015 zefram [...] fysh.org - Ticket created

Subject:	floating-point formats not according to spec
Date:	Mon, 2 Feb 2015 15:14:53 +0000
To:	bug-Sereal-Encoder [...] rt.cpan.org
From:	Zefram <zefram [...] fysh.org>

The Sereal format spec clearly says that floating-point values are to be stored in IEEE formats. However, the spec does poorly at specifying which formats these are, and the implementation actually uses not-necessarily-IEEE native formats. First, the spec. The tag table documents tags named "FLOAT", "DOUBLE", and "LONG_DOUBLE", describing the type-specific data for each as "<IEEE-FLOAT>", "<IEEE-DOUBLE>", and "<IEEE-LONG-DOUBLE>" respectively. This, along with the general note about using IEEE formats and the other general note about numeric quantities being little-endian, is the entire specification for the floating-point representations. This is a problem, because the terms "float" and "long double" don't have any defined meaning in the context of IEEE 754, so the intended referents are unclear. The family of terms "float", "double", and "long double" are actually C type names, and IEEE 754 does not specify any particular mapping of its formats onto C types. IEEE 754 defines four binary floating-point formats for data interchange. Each has an explicit name based on its bit length, and a common name based on multiples of the historical default size of 32 bits. The four formats are binary16 ("half precision", 5 exponent bits, 10 fractional significand bits), binary32 ("single precision", 8 exponent bits, 23 fractional significand bits), binary64 ("double precision", 11 exponent bits, 52 fractional significand bits), and binary128 ("quadruple precision", 15 exponent bits, 112 fractional significand bits). The spec should refer to these formats by either of their names, not by C types. Next, the implementation. srl_encoder.c (and matching srl_decoder.c of Sereal::Decoder) does not make any effort to use IEEE formats in serialised data. Instead it uses the native C floating-point types, associating float, double, and long double each with the similarly-named tag. It doesn't even canonicalise endianness: so even where the native types are IEEE, a big-endian system serialises contrary to the spec's statement that numeric data are little-endian. (With the varint clause explicitly specifying little-endian, and all other numeric quantities being single bytes, the floating-point formats seem to be the only data to which the general endianness statement apply.) Hosts with different endianness or otherwise differing floating-point formats of the same size will see corrupted numeric data when they try to exchange serialised NVs. Hosts whose floating-point formats for particular C types have different sizes will see worse corruption, by virtue of losing synchronisation between encoder and decoder. I'm not presently affected by this, but foreseeably might become affected. We configure our perls to use long double for NV, and currently for us that's the decidedly non-IEEE x87 80-bit format. (Not only is the format not one of the lengths specified by IEEE, but by using an explicit integral significand bit it doesn't even follow IEEE's rules for constructing floating-point formats.) It is foreseeable that this format will eventually be supplanted, one way or another, by the IEEE quad-precision format, which is already used for the long double type on some platforms. In switching over we would face a compatibility problem. -zefram

Mon Feb 02 10:21:51 2015 demerphq [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #101881] floating-point formats not according to spec
Date:	Mon, 2 Feb 2015 16:21:41 +0100
To:	bug-Sereal-Encoder [...] rt.cpan.org
From:	demerphq <demerphq [...] gmail.com>

On 2 February 2015 at 16:15, Zefram via RT <bug-Sereal-Encoder@rt.cpan.org> wrote: Show quoted text

> Mon Feb 02 10:15:07 2015: Request 101881 was acted upon. > Transaction: Ticket created by zefram@fysh.org > Queue: Sereal-Encoder > Subject: floating-point formats not according to spec > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: zefram@fysh.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=101881 > > > > The Sereal format spec clearly says that floating-point values > are to be stored in IEEE formats. However, the spec does poorly at > specifying which formats these are, and the implementation actually uses > not-necessarily-IEEE native formats. > > First, the spec. The tag table documents tags named "FLOAT", "DOUBLE", > and "LONG_DOUBLE", describing the type-specific data for each as > "<IEEE-FLOAT>", "<IEEE-DOUBLE>", and "<IEEE-LONG-DOUBLE>" respectively. > This, along with the general note about using IEEE formats and the > other general note about numeric quantities being little-endian, is the > entire specification for the floating-point representations. This is > a problem, because the terms "float" and "long double" don't have any > defined meaning in the context of IEEE 754, so the intended referents > are unclear. The family of terms "float", "double", and "long double" > are actually C type names, and IEEE 754 does not specify any particular > mapping of its formats onto C types. > > IEEE 754 defines four binary floating-point formats for data interchange. > Each has an explicit name based on its bit length, and a common > name based on multiples of the historical default size of 32 bits. > The four formats are binary16 ("half precision", 5 exponent bits, 10 > fractional significand bits), binary32 ("single precision", 8 exponent > bits, 23 fractional significand bits), binary64 ("double precision", 11 > exponent bits, 52 fractional significand bits), and binary128 ("quadruple > precision", 15 exponent bits, 112 fractional significand bits). The spec > should refer to these formats by either of their names, not by C types. > > Next, the implementation. srl_encoder.c (and matching srl_decoder.c > of Sereal::Decoder) does not make any effort to use IEEE formats in > serialised data. Instead it uses the native C floating-point types, > associating float, double, and long double each with the similarly-named > tag. It doesn't even canonicalise endianness: so even where the native > types are IEEE, a big-endian system serialises contrary to the spec's > statement that numeric data are little-endian. (With the varint clause > explicitly specifying little-endian, and all other numeric quantities > being single bytes, the floating-point formats seem to be the only data > to which the general endianness statement apply.) > > Hosts with different endianness or otherwise differing floating-point > formats of the same size will see corrupted numeric data when they > try to exchange serialised NVs. Hosts whose floating-point formats > for particular C types have different sizes will see worse corruption, > by virtue of losing synchronisation between encoder and decoder. > > I'm not presently affected by this, but foreseeably might become affected. > We configure our perls to use long double for NV, and currently for > us that's the decidedly non-IEEE x87 80-bit format. (Not only is > the format not one of the lengths specified by IEEE, but by using an > explicit integral significand bit it doesn't even follow IEEE's rules > for constructing floating-point formats.) It is foreseeable that this > format will eventually be supplanted, one way or another, by the IEEE > quad-precision format, which is already used for the long double type on > some platforms. In switching over we would face a compatibility problem.

We have patches in the branch proto4 to fix some of this. Could you review that branch and tell me what you think? Yves ps: Sometimes you file tickets with RT sometimes with the Sereal tracker, could we please make the sereal tracker primary? I keep having to copy your tickets over. (I dont mind if you file RT tickets as well, but I keep stealing your tickets in the main tracker.) -- perl -Mre=debug -e "/just|another|perl|hacker/"

Mon Feb 02 10:21:51 2015 The RT System itself - Status changed from 'new' to 'open'

Mon Feb 02 10:33:15 2015 zefram [...] fysh.org - Correspondence added

Subject:	Re: [rt.cpan.org #101881] floating-point formats not according to spec
Date:	Mon, 2 Feb 2015 15:33:01 +0000
To:	demerphq via RT <bug-Sereal-Encoder [...] rt.cpan.org>
From:	Zefram <zefram [...] fysh.org>

demerphq via RT wrote: Show quoted text

>We have patches in the branch proto4 to fix some of this. Could you >review that branch and tell me what you think?

Will look. Show quoted text

>ps: Sometimes you file tickets with RT sometimes with the Sereal >tracker, could we please make the sereal tracker primary?

I haven't filed Sereal-related issues anywhere other than RT. If by "the Sereal tracker" you mean the thing on github that RT has a link to, then no, I can't file tickets there. Last I checked, github doesn't offer a way to submit bug reports to its issue trackers without creating a github account and thus signing away one's firstborn. I suggest that you take this up with github, or else select as your primary bug tracker something that places fewer barriers in the way of bug reporters. -zefram

Mon Feb 02 11:54:21 2015 zefram [...] fysh.org - Correspondence added

Subject:	Re: [rt.cpan.org #101881] floating-point formats not according to spec
Date:	Mon, 2 Feb 2015 16:54:08 +0000
To:	demerphq via RT <bug-Sereal-Encoder [...] rt.cpan.org>
From:	Zefram <zefram [...] fysh.org>

demerphq via RT wrote: Show quoted text

>We have patches in the branch proto4 to fix some of this. Could you >review that branch and tell me what you think?

This branch still uses native formats, including native endianness. The only improvements are that you now retain encoder/decoder synchronisation, by asserting exact sizes for float and double and by padding long double. All the other failure modes relating to native formats still apply. You actually make things worse in another respect by making the support for long double NV non-default. The new sizeof(NV) == sizeof(float) checks that are trying to determine which type NV is are flawed: one could have two floating-point types that are the same size but behave differently. The spec in this branch is self-contradictory about the formats. It still has the general notes about IEEE formats and little-endian, but now contradicts that by describing the LONG_DOUBLE type as platform specific. The descriptions for FLOAT and DOUBLE, by giving specific sizes, now succeed in identifying specific formats (though it would still be better to use the formats' correct names), but they still don't match the code's de facto use of native formats. To fix this you need some format conversion code. It's not difficult to pull a floating-point value apart bit-by-bit to convert it to a known format, nor to perform the reverse conversion. The difficult part is to recognise the common case where some of the IEEE formats are implemented natively, in order to avoid slow bit twiddling. (More generally, if a specific native format is known then conversion between it and IEEE can be done more efficiently than would be achieved by handling it opaquely.) I recommend that, as an initial step, you should implement format conversion with no attempt at this optimisation. Correctness first. Then you can look into how to detect details of the native formats. -zefram

Bug #101881 for Sereal-Encoder: floating-point formats not according to spec

Preferred bug tracker