Vint Cerf letter to Unicode

From IUCG - Internet Users Contributing Group

Jump to: navigation, search

WG/IDNABIS Chair dialog with UNICODE before finalizing IDNA2008 text and sending it to IESG.

.

Contents


Letter to Unicode

Ms. Lisa Moore
Chairman, Unicode Technical Committee
via email: lisam@us.ibm.com
CC:
Eric Muller
Vice Chairman, Unicode Technical Committee
via email: emuller@adobe.com
Mark Davis
President, Unicode Consortium
via email: markdavis@googlle.com

28 November 2010


Dear Ms. Moore:

I am writing to you in my role as chairman of the IDNABIS working group, addressing this request to you as president of the Unicode Consortium. As you know, treatment of the two characters, Greek Small Letter Final Sigma (U+03C2) and Latin Small Letter Sharp S (U+00DF) have been the source of considerable discussion during the IDNABIS Working Group effort on specifying the IDNA2008 proposed replacement of the IDNA2003 standard for the use of Unicode in Internationalized Domain Names. Latin Capital Letter Sharp S (U+1E9E) was added in Unicode version 5.1.0 but recommended rules for its use were provided as shown below:


Begin quote from Unicode Version 5.1.0

Tailored Casing Operations

The Unicode Standard provides default casing operations. There are circumstances in which the default operations need to be tailored for specific locales or environments. Some of these tailorings have data that is in the standard, in the SpecialCasing.txt file, notable for the Turkish dotted capital I and dotless small i. In other cases, more specialized tailored casing operations may be appropriate. These include:

Titlecasing of IJ at the start of words in Dutch
Removal of accents when uppercasing letters in Greek
Uppercasing U+00DF ( ) LATIN SMALL LETTER SHARP S to the new U+1E9E LATIN CAPITAL LETTER SHARP S

However, these tailorings may or may not be desired, depending on the implementation in question.

In particular, capital sharp s is intended for typographical representations of signage and uppercase titles, and other environments where users require the sharp s to be preserved in uppercase. Overall, such usage is rare. In contrast, standard German orthography uses the string "SS" as uppercase mapping for small sharp s. Thus, with the default Unicode casing operations, capital sharp s will lowercase to small sharp s, but not the reverse: small sharp s uppercases to "SS". In those instances where the reverse casing operation is needed, a tailored operation would be required.

End quote from Unicode Version 5.1.0


In IDNA2003, Sharp S was mapped to "ss" by means of a casing operation that mapped lower case Sharp S to uppercase "SS" and then down to lowercase "ss". Registrations and lookups using the IDNA2003 rules applied this mechanism.

During the discussions in the IDNABIS Working Group on IDNA2008, a strong consensus developed around not mapping for example for registration purposes and also for preserving the property that the IDNA2008-defined A-Label and U-Label forms be fully symmetric (i.e., convertible into one another without change or loss).


During these same discussions, a consensus seemed to develop to permit (ie. make "PVALID" in IDNA2008 parlance) Latin Small Letter Sharp S (U+00DF) and Greek Small Letter Final Sigma (U+03C2). The recommended casing actions of Unicode (i.e. toCaseFold) on Sharp S and Final Sigma produce "ss" in the case of Sharp S and Greek Small Letter Sigma (U+03C3) in the case of Final Sigma.

To make the lowercase forms PVALID using the functional rules of IDNA2008, exceptions were required to overcome the recommended casing mechanics of Unicode (i.e. application of CaseFolding).

Note that IDNA2008 explicitly permits mapping for User Interface purposes:

a) draft-ietf-idnabis-protocol-17#section-5.2
c) draft-ietf-idnabis-rationale-14#section-4.4
d) draft-ietf-idnabis-rationale-14#section-6
e) draft-ietf-idnabis-rationale-14#section-7.3
f) draft-ietf-idnabis-mappings-05

If Small Letter Sharp S and Small Letter Final Sigma were to be made DISALLOWED, these mapping provisions would permit these characters to be handled as a User Interface matter prior to lookup.

Because the practices of IDNA2003 are in conflict with the proposed practices of IDNA2008, and because the Last Call discussions have surfaced controversy over the incorporation of the two lowercase forms in question, I request an organizational recommendation from UTC as to the treatment of these characters. Taking into account the prohibition of mapping on registration, which I take to be firm, and the requirement that A-Label and U-Label forms must be unambiguously convertible into each other, would the UTC recommend to exclude the use of Small Letter Sharp S and Small Letter Final Sigma in IDNA2008 by removing their exceptions and making each DISALLOWED?


A prompt response would be much appreciated considering we have delayed reporting the results of the IETF LAST CALL to the Internet Engineering Steering Group while this matter is debated.

Sincerely,

Vinton Cerf
Chairman, IDNABIS Working Group of the Internet Engineering Task Force


Unicode response

Dear Vint,

The UTC appreciates the difficulty for users of IDNs, the registries,and all involved if lowercase sharp s (Latin Small Letter Sharp S (U+00DF)) and small final sigma (Greek Small Letter Final Sigma (U+03C2)), in particular, are mapped to other characters. As you know, our concerns are compatibility and potential security issues. However, based on the many ongoing dicussions and much thought, the UTC would not be opposed to have lowercase sharp s, final sigma, and even joiner and non-joiner be valid and not mapped, as long as there can be policies in place for a transition period (of say 5 years) that will manage the expected compatibility issues.

The key for us is having policies for a well-managed transition with sufficient time for browser and other application upgrades. Without such policies in place, we would favor continuing the IDNA2003 treatment of the four above-mentioned characters.

Best regards,
Lisa Moore
Chair, Unicode Technical Committee


Mark Davis, Unicode President

Problem

We would like to have the 4 deviation characters be valid, at some point. The key problem is that we don't want current URLs in web pages, etc. to go to two different locations depending on the browser, nor do we want joe@fußball.com to go sometimes to joe@fußball.com and sometimes to joe@fussball.com. Even once IDNA2008 is approved, for a long time a majority of the implementations will still be IDNA2003, so this also goes for new label registrations during the transition period.

Proposal

IDNA2008 changes as follows:

The 4 deviation characters get the property PVALID_AFTER_2015

The requirements are:

On registration, PVALID_AFTER_2015 is equivalent to PVALID
On lookup, PVALID_AFTER_2015 is treated as DISALLOWED up until 2016 Jan 1, 00:00:00 GMT, and treated as PVALID thereafter.
Implementations must not map the characters after the switchover date.
Implementations that map the characters before that date, must map as in IDNA2003.

The goal is to

allow the 4 character to become valid, as soon as possible; avoid the 'nightmare' scenario of the same URL going to two different locations, as much as possible.

Scenarios

Let's see what happens with fußball.xxx over time, where xxx is some registry (eg .de, .blogspot.com, or others). Background: essentially all browsers and other major implementations are planning to map for compatibility. We'll look at browsers, but this also applies to email, etc.


Early 2010 (just as IDNA2008 is approved)
At this time the world browsers are 100% IDNA2003
  • browsers map fußball.xxx to fussball.xxx.
  • registries can start accepting eszett, and should bundle with ss.
  • fußball shows up as fussball in the address bar
  • note: it is only by convention that fussball is seen in the address bar in this case; a browser could also display fußball, as in UTS46.
  • results:
  • if the registry bundles, both fußball.xxx and fussball.xxx go to the same owner.
  • if the registry doesn't bundle, both fußball.xxx and fussball.xxx go to the same owner.
  • The odd IDNA2008 browser that doesn't map just fails, because ß is not PVALID; it doesn't take fußball.xxx to a different location than the vast majority of browsers.


In 2013
At this time the world browsers are 50% IDNA2003, 50% IDNA2008

same as above. No ambiguity in results.


In 2016 Feb
At this time the world browsers are 1% IDNA2003, 99% IDNA2008
  • 99% of browsers switch to not mapping fußball.xxx.
  • Registries no longer need to bundle; they can have different owners for fußball.xxx and fussball.xxx.
  • fußball shows up as fußball in the address bar
  • results:
  • if the registry bundles, both fußball.xxx and fussball.xxx go to the same owner.
  • if the registry doesn't bundle, fußball.xxx and fussball.xxx go to different owners.
  • The odd IDNA2003 browser that is left goes to the wrong location for the affected languages; people that use them need to upgrade.


Acknowledgment of the WG/IDNABIS Chair

Dear Lisa,

Thank you and the UTC for its rapid response.

I believe that the discussions of the past week have confirmed a general consensus on the preference that Final Sigma and Sharp-S be PVALID. We did not poll for the joiner/non-joiner question because a consensus already existed, in my opinion, as chair, for these to be contextually valid (CONTEXTJ).

The method of introduction of IDNA2008 is important to all of us, to promote its utility. At the close of the day, I will review all of the comments received and attempt to synthesize what I believe is a plan around which consensus can be obtained.

Vint Cerf


WG/IDNABIS Chair assesment

Personal tools