CC: | Ivan Shmakov <oneingray [...] gmail.com> |
Subject: | use UTF-32 for UniversalString and UTF-16 for BMPString (as per X.690) |
Date: | Thu, 04 Oct 2012 12:27:06 +0700 |
To: | bug-Convert-ASN1 [...] rt.cpan.org |
From: | Ivan Shmakov <oneingray [...] gmail.com> |
X.680 [1] reads:
37.16
UTF8String is synonymous with UniversalString at the abstract level
and can be used wherever UniversalString is used (subject to rules
requiring distinct tags) but has a different tag and is a distinct
type. NOTE — The encoding of UTF8String used by BER and PER is
different from that of UniversalString, and for most text will be
less verbose.
The X.690 [2] specification (covering BER and DER) states:
8.21.7
For the UniversalString type, the octet string shall contain the
octets specified in ISO/IEC 10646-1, using the 4-octet canonical
form (see 13.2 of ISO/IEC 10646-1). [...]
8.21.8
For the BMPString type, the octet string shall contain the octets
specified in ISO/IEC 10646-1, using the 2-octet BMP form (see 13.1
of ISO/IEC 10646-1). [...]
[...]
8.21.10
For the UTF8String type, the octet string shall contain the octets
specified in ISO/IEC 10646-1, Annex D. Announcers and escape
sequences shall not be used, and each character shall be encoded in
the smallest number of octets available for that character.
Thus, it's my understanding that the encodings used for
UniversalString, BMPString and UTF8String shall be UTF-32
(UCS-4?), UTF-16 (UCS-2?), and UTF-8, respectively (see, e. g.,
[3].)
Contrary to the above, Convert::ASN1 currently (as of 0.26)
encodes all of those using UTF-8.
Consider, e. g.:
$ cat < j14gcqstwotsbqsjauytzxyitn.pl
### j14gcqstwotsbqsjauytzxyitn.pl -*- Perl -*-
use strict;
use warnings;
require Convert::ASN1;
require Data::Dump;
require IO::Handle;
my $asn = Convert::ASN1->new (qw (encoding BER));
$asn->prepare (q {
Foo ::= UniversalString
Bar ::= BMPString
Baz ::= UTF8String
})
or die ($!);
my $s
= "\x{0401}\x{0436}\n";
binmode (\*STDOUT)
or die ($!);
foreach my $t (qw (Foo Bar Baz)) {
my $co
= $asn->find ($t)
or die ($asn->error ());
my $enc = $co->encode ($s)
or die ($co->error ());
print STDOUT ($enc);
print STDERR (Data::Dump::dump ($t, length ($enc), $enc),
"\n");
}
### j14gcqstwotsbqsjauytzxyitn.pl ends here
$ perl -w -- j14gcqstwotsbqsjauytzxyitn.pl \
| od -t x1 -w7
("Foo", 7, "\34\5\xD0\x81\xD0\xB6\n")
("Bar", 7, "\36\5\xD0\x81\xD0\xB6\n")
("Baz", 7, "\f\5\xD0\x81\xD0\xB6\n")
0000000 1c 05 d0 81 d0 b6 0a
0000007 1e 05 d0 81 d0 b6 0a
0000016 0c 05 d0 81 d0 b6 0a
0000025
$
My guess is that in order to fix the issue, distinct op* types
(opUTF32STRING, opUTF16STRING?) should be introduced for the
UniversalString and BMPString ASN.1 types to be mapped to (via
%base_type):
$ nl -ba < Convert/ASN1/parser.pm
…
23
24 my %base_type = (
25 BOOLEAN => [ asn_encode_tag(ASN_BOOLEAN), opBOOLEAN ],
…
56 UniversalString => [ asn_encode_tag(ASN_UNIVERSAL | 28), opSTRING ],
57 BMPString => [ asn_encode_tag(ASN_UNIVERSAL | 30), opSTRING ],
…
$
… Along with the respective _enc_* (Convert/ASN1/_encode.pm)
functions.
TIA.
[1] http://www.itu.int/ITU-T/studygroups/com17/languages/X.680-0207.pdf
[2] http://www.itu.int/ITU-T/studygroups/com17/languages/X.690-0207.pdf
[3] http://perldoc.perl.org/perlunicode.html
--
FSF associate member #7257