Skip Menu |

This queue is for tickets about the Net-IDN-Encode CPAN distribution.

Report information
The Basics
Id: 94347
Status: rejected
Priority: 0/
Queue: Net-IDN-Encode

People
Owner: CFAERBER [...] cpan.org
Requestors: rsimoes [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Unimportant
Broken in: (no value)
Fixed in: (no value)



Subject: Encoding should return exact copy of labels consisting entirely of ascii characters
Passing an all-ascii hostname label to the encode function returns the domain with the trailing hyphen: $ perl -MNet::IDN::Punycode=encode_punycode -E 'say encode_punycode("foobar")' foobar- The RFC explicitly guarantees against such a case: "Therefore IDNA using Punycode conforms to the RFC 952 rule that host name labels neither begin nor end with a hyphen-minus [RFC952]." In such a case, encoding should return an exact copy of the input.
Hi, Thanks for this report. The behavior of Net::IDN::Punycode is correct, although it seems surprising at first glance. The full quote from RFC 3492 (section 5) reads: Using hyphen-minus as the delimiter implies that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points, but IDNA forbids such strings from being encoded. The encoded string can begin with a hyphen-minus, but IDNA prepends a prefix. Therefore IDNA using Punycode conforms to the RFC 952 rule that host name labels neither begin nor end with a hyphen-minus [RFC952]. The text explicitly says "that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points". The string "foobar" does consist entirely of basic code points, and thus the encoding is allowed to end with a hyphen-minus. While adding a hyphen-minus seems stupid, it is the correct thing to do. Punycode encoding always produces a string in the format "<basic>-<punycode>" or just "<punycode>". It can never produce a string in the format (wrong:)"<basic>" without a trailing delimiter. Otherwise, decoding would be impossible because you could never know whether a string should be interpreted literally as basic code points ("<basic>") or as encoded characters ("<punycode>"). This is also confirmed by the following test vector included in RFC 3492 (section 7.1): (S) -> $1.00 <- u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D Punycode: -> $1.00 <-- (Note that the input ends with a single "-", whereas the Punycode ends in "--".) The guarantee that the string never ends in "-" is only true together with higher-level IDNA handling (which Net::IDN::Punycode, being the pure Punycode codec, does not provide). The RFC says that IDNA forbids ASCII-only strings from being encoded. In other words, in IDNA, "foobar" will just be copied to "foobar" but it will not be Punycode-encoded to "foobar-" and then prefixed with the IDNA prefix to form (wrong:)"xn--foobar-".
Okay, I understand now. Sorry for the report, and thanks for your patient explanation! On Mon Mar 31 17:43:25 2014, CFAERBER wrote: Show quoted text
> Hi, > > Thanks for this report. The behavior of Net::IDN::Punycode is correct, > although it seems surprising at first glance. > > The full quote from RFC 3492 (section 5) reads: > > Using hyphen-minus as the delimiter implies that the encoded string > can end with a hyphen-minus only if the Unicode string consists > entirely of basic code points, but IDNA forbids such strings from > being encoded. The encoded string can begin with a hyphen-minus, but > IDNA prepends a prefix. Therefore IDNA using Punycode conforms to > the RFC 952 rule that host name labels neither begin nor end with a > hyphen-minus [RFC952]. > > The text explicitly says "that the encoded string can end with a > hyphen-minus only if the Unicode string consists entirely of basic > code points". The string "foobar" does consist entirely of basic code > points, and thus the encoding is allowed to end with a hyphen-minus. > > While adding a hyphen-minus seems stupid, it is the correct thing to > do. Punycode encoding always produces a string in the format "<basic>- > <punycode>" or just "<punycode>". It can never produce a string in the > format (wrong:)"<basic>" without a trailing delimiter. Otherwise, > decoding would be impossible because you could never know whether a > string should be interpreted literally as basic code points > ("<basic>") or as encoded characters ("<punycode>"). > > This is also confirmed by the following test vector included in RFC > 3492 (section 7.1): > > (S) -> $1.00 <- > u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 > u+003C u+002D > Punycode: -> $1.00 <-- > > (Note that the input ends with a single "-", whereas the Punycode ends > in "--".) > > The guarantee that the string never ends in "-" is only true together > with higher-level IDNA handling (which Net::IDN::Punycode, being the > pure Punycode codec, does not provide). The RFC says that IDNA forbids > ASCII-only strings from being encoded. In other words, in IDNA, > "foobar" will just be copied to "foobar" but it will not be Punycode- > encoded to "foobar-" and then prefixed with the IDNA prefix to form > (wrong:)"xn--foobar-".