Bug #88964 for RDF-Trine: don't lowercase lang tag in RDF::Trine::Node::Literal

Wed Sep 25 10:56:04 2013 http://openid-provider.appspot.com/vladimir@sirma.bg - Ticket created

Subject:

don't lowercase lang tag in RDF::Trine::Node::Literal

Many parts of a language tag are case-sensitive, eg region (BG), script (Latn). You can see this at http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry. In fact, case is used to recognize these parts. Therefore you must NOT lowercase the language tag. (This bug was found with RDB2RDF 0.008, and I assume that Trine 1.007 is used, but I am not certain)

Wed Sep 25 11:29:49 2013 gwilliams [...] cpan.org - Correspondence added

On Wed Sep 25 10:56:04 2013, http://openid-provider.appspot.com/vladimir@sirma.bg wrote: Show quoted text

> Many parts of a language tag are case-sensitive, eg region (BG), > script (Latn). You can see this at > http://www.iana.org/assignments/language-subtag-registry/language- > subtag-registry. > In fact, case is used to recognize these parts. Therefore you must NOT > lowercase the language tag. > > (This bug was found with RDB2RDF 0.008, and I assume that Trine 1.007 > is used, but I am not certain)

My reading of RFC 5646 (which obsoletes RFC 3066 which the RDF semantics are based on) is that this isn't the case. While the language tag registry is case preserving, section 2.1.1 of RFC 5646 says: """ At all times, language tags and their subtags, including private use and extensions, are to be treated as case insensitive: there exist conventions for the capitalization of some of the subtags, but these MUST NOT be taken to carry meaning. """ I'd be happy to consider suggestions for an API in RDF::Trine that might preserve the case of language tags as an opt-in feature, but I believe the current approach of normalizing by lowercasing is in-line with the standard.

Wed Sep 25 11:29:50 2013 The RT System itself - Status changed from 'new' to 'open'

Wed Sep 25 19:11:47 2013 http://openid-provider.appspot.com/vladimir@sirma.bg - Correspondence added

Subject:	don't lowercase lang tag in RDF::Trine::Node::Literal
From:	vladimir.alexiev [...] ontotext.com

I'm wrong: the subtags can be recognized regardless of case, as described at "two exceptions: two-letter and four-letter subtags...". But RFC 5646 (BCP 47) still recommends: "consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived." In other words, RFC 5646 demands that implementations are case-insensitive, and recommends that they are case-preserving (like the Windows file system is) or case-normalizing. The RFC describes a case normalization algorithm: "An implementation can reproduce this format without accessing the registry as follows...". So if you want to normalize the case, please use that algorithm instead of lowercasing. Else, preserve the case, and rely on consumers of the tag to treat it case-insensitively.

Wed Sep 25 19:44:27 2013 http://openid-provider.appspot.com/vladimir@sirma.bg - Correspondence added

From:

vladimir.alexiev [...] ontotext.com

Here is a sub _lang_normalize and some tests.

Subject:

lang_normalize.pl

sub _lang_normalize($) { # http://tools.ietf.org/html/bcp47#section-2.1.1 # All subtags use lowercase letters my $lang = lc(shift); # with 2 exceptions: subtags that neither appear at the start of the tag nor occur after singletons # i.e. there's a subtag of length at least 2 preceding the exception; and a following subtag or end-of-tag # 1. two-letter subtags are all uppercase $lang =~ s{(?<=\w\w-)(\w\w)(?=($|-))}{\U$1}g; # 2. four-letter subtags are titlecase $lang =~ s{(?<=\w\w-)(\w\w\w\w)(?=($|-))}{\u\L$1}g; $lang } sub test { my $x = _lang_normalize(shift); my $y = shift; $x eq $y or print STDERR "$x\n" } # BCP47 tests test("en-ca-x-ca","en-CA-x-ca"); test("EN-ca-X-Ca","en-CA-x-ca"); test("En-Ca-X-Ca","en-CA-x-ca"); test("SGN-BE-FR","sgn-BE-FR"); test("sgn-be-fr","sgn-BE-FR"); test("AZ-latn-x-LATN","az-Latn-x-latn"); test("Az-latn-X-Latn","az-Latn-x-latn"); # More tests test("zh-Hant","zh-Hant"); test("zh-Latn-wadegile","zh-Latn-wadegile"); test("zh-Latn-pinyin","zh-Latn-pinyin"); test("en-US","en-US"); test("en-GB","en-GB"); test("qqq-002","qqq-002"); test("ja-Latn","ja-Latn"); test("x-local","x-local"); test("he-Latn","he-Latn"); test("und","und"); test("nn","nn"); test("ko-Latn","ko-Latn"); test("ar-Latn","ar-Latn"); test("la-x-liturgic","la-x-liturgic"); test("fa-x-middle","fa-x-middle"); test("qqq-142","qqq-142"); test("bnt","bnt"); test("grc-x-liturgic","grc-x-liturgic"); test("egy-Latn","egy-Latn"); test("la-x-medieval","la-x-medieval");

Wed Oct 16 22:38:37 2013 gwilliams [...] cpan.org - Correspondence added

On Wed Sep 25 19:44:27 2013, http://openid-provider.appspot.com/vladimir@sirma.bg wrote: Show quoted text

> Here is a sub _lang_normalize and some tests.

After having some time to think this over and try it out with some existing code, I think I've changed my mind about this. I've created a branch in git to look at doing normalizing as you requested in Literal object construction. I'm going to solicit feedback from other perlrdf users before making a decision about merging the code. Any discussion will happen on the github issue page: https://github.com/kasei/perlrdf/issues/91 thanks, .greg

Mon Jan 20 11:01:20 2014 vladimir.alexiev [...] ontotext.com - Correspondence added

From:

vladimir.alexiev [...] ontotext.com

Did you get any feedback from the Perl community? A similar discussion is going on re Sesame RIO, eg see https://openrdf.atlassian.net/browse/SES-1659?focusedCommentId=15100&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15100 And I posted to public-rdf-comments: http://lists.w3.org/Archives/Public/public-rdf-comments/2014Jan/0011.html : http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal "Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case." CHANGE TO: "Lexical representations of language tags MAY be normalized, according to BCP47 section 2.1.1. "Formatting of Language Tags" (country codes in upper case, script codes capitalized, the rest in lower case). Language tags MAY also be normalized by converting all to lower case, but BCP47 normalization is preferred. No matter which method is chosen, the semantics of language tags MUST NOT depend on case. In particular, implementations MUST NOT store as separate statements, two statements that differ only by the case of language tags."

Mon Jan 20 15:14:03 2014 gwilliams [...] cpan.org - Correspondence added

On Mon Jan 20 11:01:20 2014, vladimir.alexiev@ontotext.com wrote: Show quoted text

> Did you get any feedback from the Perl community?

The people that I spoke with seemed not to have strong opinions, so I think I'll go ahead and merge it and it'll appear in the next release.

Mon Jan 20 15:14:04 2014 gwilliams [...] cpan.org - Status changed from 'open' to 'patched'

Tue Jan 21 01:43:54 2014 gwilliams [...] cpan.org - Correspondence added

I've merged the language tag normalization code: https://github.com/kasei/perlrdf/commit/01f070befce4cbddf985527059bedaaa0518017b

Tue Jan 21 01:43:55 2014 gwilliams [...] cpan.org - Status changed from 'patched' to 'resolved'

Bug #88964 for RDF-Trine: don't lowercase lang tag in RDF::Trine::Node::Literal

Preferred bug tracker