Bug #94347 for Net-IDN-Encode: Encoding should return exact copy of labels consisting entirely of ascii characters

Mon Mar 31 14:37:25 2014 rsimoes [...] cpan.org - Ticket created

Subject:

Encoding should return exact copy of labels consisting entirely of ascii characters

Passing an all-ascii hostname label to the encode function returns the domain with the trailing hyphen: $ perl -MNet::IDN::Punycode=encode_punycode -E 'say encode_punycode("foobar")' foobar- The RFC explicitly guarantees against such a case: "Therefore IDNA using Punycode conforms to the RFC 952 rule that host name labels neither begin nor end with a hyphen-minus [RFC952]." In such a case, encoding should return an exact copy of the input.

Mon Mar 31 19:43:25 2014 CFAERBER [...] cpan.org - Severity Unimportant added

Mon Mar 31 19:43:25 2014 CFAERBER [...] cpan.org - Correspondence added

Hi, Thanks for this report. The behavior of Net::IDN::Punycode is correct, although it seems surprising at first glance. The full quote from RFC 3492 (section 5) reads: Using hyphen-minus as the delimiter implies that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points, but IDNA forbids such strings from being encoded. The encoded string can begin with a hyphen-minus, but IDNA prepends a prefix. Therefore IDNA using Punycode conforms to the RFC 952 rule that host name labels neither begin nor end with a hyphen-minus [RFC952]. The text explicitly says "that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points". The string "foobar" does consist entirely of basic code points, and thus the encoding is allowed to end with a hyphen-minus. While adding a hyphen-minus seems stupid, it is the correct thing to do. Punycode encoding always produces a string in the format "<basic>-<punycode>" or just "<punycode>". It can never produce a string in the format (wrong:)"<basic>" without a trailing delimiter. Otherwise, decoding would be impossible because you could never know whether a string should be interpreted literally as basic code points ("<basic>") or as encoded characters ("<punycode>"). This is also confirmed by the following test vector included in RFC 3492 (section 7.1): (S) -> $1.00 <- u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D Punycode: -> $1.00 <-- (Note that the input ends with a single "-", whereas the Punycode ends in "--".) The guarantee that the string never ends in "-" is only true together with higher-level IDNA handling (which Net::IDN::Punycode, being the pure Punycode codec, does not provide). The RFC says that IDNA forbids ASCII-only strings from being encoded. In other words, in IDNA, "foobar" will just be copied to "foobar" but it will not be Punycode-encoded to "foobar-" and then prefixed with the IDNA prefix to form (wrong:)"xn--foobar-".

Mon Mar 31 19:43:25 2014 The RT System itself - Status changed from 'new' to 'open'

Mon Mar 31 19:43:26 2014 CFAERBER [...] cpan.org - Status changed from 'open' to 'rejected'

Mon Mar 31 19:43:26 2014 CFAERBER [...] cpan.org - Taken

Wed Apr 02 15:37:52 2014 rsimoes [...] cpan.org - Correspondence added

Okay, I understand now. Sorry for the report, and thanks for your patient explanation! On Mon Mar 31 17:43:25 2014, CFAERBER wrote: Show quoted text

> Hi, > > Thanks for this report. The behavior of Net::IDN::Punycode is correct, > although it seems surprising at first glance. > > The full quote from RFC 3492 (section 5) reads: > > Using hyphen-minus as the delimiter implies that the encoded string > can end with a hyphen-minus only if the Unicode string consists > entirely of basic code points, but IDNA forbids such strings from > being encoded. The encoded string can begin with a hyphen-minus, but > IDNA prepends a prefix. Therefore IDNA using Punycode conforms to > the RFC 952 rule that host name labels neither begin nor end with a > hyphen-minus [RFC952]. > > The text explicitly says "that the encoded string can end with a > hyphen-minus only if the Unicode string consists entirely of basic > code points". The string "foobar" does consist entirely of basic code > points, and thus the encoding is allowed to end with a hyphen-minus. > > While adding a hyphen-minus seems stupid, it is the correct thing to > do. Punycode encoding always produces a string in the format "<basic>- > <punycode>" or just "<punycode>". It can never produce a string in the > format (wrong:)"<basic>" without a trailing delimiter. Otherwise, > decoding would be impossible because you could never know whether a > string should be interpreted literally as basic code points > ("<basic>") or as encoded characters ("<punycode>"). > > This is also confirmed by the following test vector included in RFC > 3492 (section 7.1): > > (S) -> $1.00 <- > u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 > u+003C u+002D > Punycode: -> $1.00 <-- > > (Note that the input ends with a single "-", whereas the Punycode ends > in "--".) > > The guarantee that the string never ends in "-" is only true together > with higher-level IDNA handling (which Net::IDN::Punycode, being the > pure Punycode codec, does not provide). The RFC says that IDNA forbids > ASCII-only strings from being encoded. In other words, in IDNA, > "foobar" will just be copied to "foobar" but it will not be Punycode- > encoded to "foobar-" and then prefixed with the IDNA prefix to form > (wrong:)"xn--foobar-".