Newprep Debate

From IUCG - Internet Users Contributing Group

Jump to: navigation, search

Contents

Proposed Charter for NewPrep WG

Proposed Charter for NewPrep WG

Version: 0.5
Last Updated: 2010-04-07

Problem Statement

The use of non-ASCII strings in Internet protocols requires additional processing to be handled properly. As part of the Internationalized Domain Names (idn) work in 2003, a method for preparation and comparison of internationalized strings was defined and generalized to be re-used by other protocols. This "stringprep" method [RFC 3454] defines the overall framework whereas specific protocols define their own profiles. Known existing IETF profiles are:

  • The Nameprep profile [RFC 3490] for use in Internationalized Domain Names in Applications (IDNA)
  • The iSCSI profile [RFC 3722] for use in Internet Small Computer Systems Interface (iSCSI) Names
  • The Nodeprep and Resourceprep profiles [RFC 3920] for use in the Extensible Messaging and Presence Protocol (XMPP)
  • The Policy MIB profile [RFC 4011] for use in the Simple Network Management Protocol (SNMP)
  • The SASLprep profile [RFC 4013] for use in the Simple Authentication and Security Layer (SASL)
  • The trace profile [RFC 4505] for use with the SASL ANONYMOUS mechanism
  • The LDAP profile (RFC 4518] for use with LDAP

The IAB completed a review of IDN and made recommendations for changes [RFC 4690], which triggered a new version of the IDNA protocol called IDNA2008. Whereas IDNA2003 was tied to Unicode 3.2 via stringprep, IDNA2008 does not use the stringprep method, but instead uses an algorithm based on the properties of Unicode characters, which makes it agile to the Unicode database version. The protocols using stringprep need Unicode version agility and therefore need to investigate how to move from the current stringprep approach, with the associated challenges of backward compatibility and migration.

Objectives

The goal of this group is to assess whether a new method based on the new IDNA2008 algorithmic approach is the appropriate path forward for existing stringprep protocols as well as for other application protocols requiring internationalized strings.

The group will evaluate if a new generalized framework based on the algorithmic approach is appropriate and, if so, define it.

The group will analyze existing stringprep profiles and will do one of the following with regard to each profile:

1. Develop a replacement for the profile in close collaboration with the related protocol working group.

2. Collaborate with another active working group which will be developing the new profile as part of its charter.

3. Advise the authors of profiles that were produced outside the context of any working group regarding how to proceed.

The group will also define a set of best current practices for preparation and comparison of internationalized strings.

In completing its tasks, the working group should collaborate with other teams involved in internationalized identifiers, such as the IETF's IRI and EAI working groups as well as other relevant standards development organizations (e.g., the Unicode Consortium).

Deliverables

1. Problem statement / analysis of existing stringprep profiles (Informational).

2. Possible new framework to replace stringprep (Standards Track).

3. Possible replacements for existing stringprep profiles (Standards Track).

4. Internationalization guidelines (BCP).

Milestones

  • Aug 2010 - Accept problem statement document as a WG item
  • Nov 2010 - Accept framework document as a WG item
  • Nov 2010 - Accept new profile documents as WG items
  • Dec 2010 - Start Working Group Last Call on problem statement document
  • Jan 2011 - Submit problem statement document to the IESG
  • Jan 2011 - Accept guidelines document as a WG item
  • May 2011 - Start Working Group Last Call on framework document
  • May 2011 - Start Working Group Last Call on new profile documents
  • Jun 2011 - Submit framework document to the IESG
  • Jun 2011 - Submit new profile documents to the IESG
  • Jun 2011 - Start Working Group Last Call on guidelines document
  • Aug 2011 - Submit guidelines document to the IESG

At 22:22 03/02/2010, Peter Saint-Andre wrote

The following is a proposed charter for a "NewPrep" WG to work on replacements for stringprep profiles in application and security protocols such as XMPP and SASL. We plan to hold a BoF at IETF 77 to explore whether it makes sense to form a working group on this topic. If you are interested, please join the newprep@ietf.org list:

https://www.ietf.org/mailman/listinfo/newprep

My apologies for cross-posting; please send replies to newprep@ietf.org.

Thanks!

Peter


At 22:17 02/04/2010, Peter Saint-Andre wrote

Marc and I have uploaded draft minutes of the Newprep BoF here:

http://www.ietf.org/proceedings/10mar/minutes/newprep.txt

Corrections are welcome!

Peter


Newprep BoF Meeting Minutes

Wednesday, March 24th, 2010
IETF 77, Anaheim, CA, USA

BoF Co-chairs: Marc Blanchet, Peter Saint-Andre

Mailing list: newprep@ietf.org

Peter Saint-Andre described the rationale, goals, and expected outcome for this BoF.

Patrick Falstrom described the work of IDNAbis and the rationale behind moving beyond Stringprep. Discussions with the floor touched on compatibility between IDNA2003 and IDNAbis and how this relates to protocols using Stringprep.

David Black, Peter Saint-Andre, and Kurt Zeilenga presented about the use of Stringprep in iSCSI, LDAP, SASL, and XMPP.

XMPP uses 3 different Stringprep profiles (one via IDNA2003). SASL uses 1 Stringprep profile. LDAP uses a family of profiles. iSCSI uses one Stringprep profile, where the identifier contains a domain name.

Most speakers stated that their use of Stringprep must be updated for various reasons such as:

  • update to latest Unicode standard
  • get Unicode version agility as IDNA2008
  • better bidi support

Most speakers care about backward compatibility, such as the characters that were removed in IDNAbis, the change in the normalization form, and case-mapping. These would need to be addressed for protocols replacing Stringprep.

Most speakers stated that this new work should not be done in their respective working groups, since the expertise in i18n is not there. They would highly prefer being done by a group of i18n experts with obviously close discussions among the respective working group.

The BoF was polled on the following questions:

  • do people think there is a problem to solve? unanimously yes.
  • do people think the problem should be solved by the related working groups or by forming a specific working group? unanimously: forming a specific working group.

END


At 21:54 07/04/2010, Peter Saint-Andre wrote:


As a follow-up to IETF 77, Marc and I have been working on a revised charter. Feedback is welcome.

At 12:51 08/04/2010, Alexey Melnikov wrote

This looks good. A couple of minor comments below: [...]


The group will analyze existing stringprep profiles and will do one of the following with regard to each profile:

1. Develop a replacement for the profile in close collaboration with the related protocol working group.

2. Collaborate with another active working group which will be developing the new profile as part of its charter.

3. Advise the authors of profiles that were produced outside the context of any working group regarding how to proceed. I think it doesn't matter if a stringprep profile was produced in a WG, what matters if there is a current WG related to the profile. [...]


Deliverables

1. Problem statement / analysis of existing stringprep profiles (Informational).

2. Possible new framework to replace stringprep (Standards Track).

3. Possible replacements for existing stringprep profiles (Standards Track). IESG would probably like to see a specific list of profiles, not an open ended list. This can be used to explicitly exclude some profiles, if desired. But either way the charter should explicitly state that it will only work on a fixed list of profiles (and working on new ones would require rechartering), or if any new profile can be added at any time.


4. Internationalization guidelines (BCP).

At 05:13 09/04/2010, YAO Jiankang wrote:

>4. Internationalization guidelines (BCP).

how about rename it to "Internationalized string guidelines "?

"Internationalization guidelines " can say many things: Internationalization whatever it is.

if we focus on stringprep, limiting the scope to the Internationalization string seems to be better.

At 06:23 09/04/2010, Peter Saint-Andre wrote

On 4/8/10 4:51 AM, Alexey Melnikov wrote: ‘’I think it doesn't matter if a stringprep profile was produced in a WG, what matters if there is a current WG related to the profile.’’

True.

‘’IESG would probably like to see a specific list of profiles, not an open ended list. This can be used to explicitly exclude some profiles, if desired. But either way the charter should explicitly state that it will only work on a fixed list of profiles (and working on new ones would require rechartering), or if any new profile can be added at any time.’’

That's a good point, but on the other hand there aren't many existing profiles. They're listed at the beginning of the charter. Now, perhaps it makes sense to prioritize them, but the tasks are limited.

At 14:38 09/04/2010, Marc Blanchet wrote

4. Internationalization guidelines (BCP). how about rename it to "Internationalized string guidelines "?

right. I agree.

Marc.

‘’perhaps it makes sense to prioritize them, but the tasks are limited.’’

I guess there is no harm in re-listing them, but at the same time, the sentence in 3. includes "existing" which is a finite, well-known set. And I don't think we should exclude any profile at this point.

Therefore, I'm neutral to either way, but I think current text is ok, given the word "existing".

Marc.

At 00:08 10/04/2010, Mark Lentczner wrote:

I thought you all might be interested in a report on my current work in this area:

Both as part of my work in the VWRAP working group, and for an internal design problem at the company I work for (Linden Lab), I have been deeply exploring the issues of defining acceptable identifiers from from the range of Unicode strings. Alas, I find myself in a similar boat as the protocols discussed at the BoF: There is no longer definitive spec to base on!

I started by finding guidance from three major sources:

  • StringPrep (RFC 3454)
  • IDNAbis 2008 (specifically the tables document)
  • Unicode 5.2 UAX #15 & UAX #31 (normalization and identifier syntax)

From each of these, I extracted the core algorithm for validating characters in identifiers. In the case of UAX #31, and StringPrep, I had to develop profiles of each to match my intended use (user names). I tried to make the simplest choices here and follow the guidance that each document had.

For StringPrep I also had to "rebase" the work: It is defined in terms of Unicode 3.2. What i did is take the guidance from the text as well as investigating the Unicode Database (UCD) and developed what I think the intent of StringPrep is, but now defined in terms of properties from the UCD rather than just lists of code points.

In the cases of IDNAbis 2008 and UAX #31, I also, for now, ignored the contextual checks that a very few characters in those specs require.

I rendered each as an function in Python that took a character and returned one of: ALLOWED, DISALLOWED, STRIPPED, UNASSIGNED. Since StringPrep and UAX #31 don't distinguish UNASSIGNED from DISALLOWED, I added that discrimination logic into each of those for purposes of comparison. I recognize that these functions do not represent a precise implementation of the relevant specs, but they are close enough for the purpose of trying gain an insight into how these specs are designed and how I might proceed with building the specs I will need.

Now what I did is run every Unicode code point from U+0000 through U+10FFFF through these three functions and generate lists of blocks of code points where these differ. There are surprisingly fewer blocks than I imagined, only 282 blocks that break down as:

  1 IDNA:  ALLOWED      ||  PREP:  DISALLOWED   ||  UAX32:  ALLOWED
 50 IDNA:  DISALLOWED   ||  PREP:  ALLOWED      ||  UAX32:  ALLOWED
230 IDNA:  DISALLOWED   ||  PREP:  ALLOWED      ||  UAX32:  DISALLOWED
 11 IDNA:  DISALLOWED   ||  PREP:  STRIPPED     ||  UAX32:  DISALLOWED

Looking at the details, these four categories of blocks are:

  • 1) U+0340 & U+0341, combining tone marks
  • 2) Various punctuation, inter-character fillers, letter-like numbers, symbol combining forms, and variation selectors
  • 3) Various symbols, and punctuation (mostly script specific punctuation)
  • 4) Non-semantic zero-width, deprecated, language tags

It seems clear to me that the StripPrep differs primarily in being more liberal with symbols, punctuation and numbers. I will choose the more conservative path. StringPrep also strips some things that the others disallow. This could be considered part of a "pre-sanitization" step (like normalization), and so, is essentially an independent design choice.

The remaining differences between IDNAbis 2008 and UAX #31 are in that the later is more liberal with punctuation, numbers and few other things. Since UAX #31 is expected to be "augmented" by adjusting it's base character classes based on what is legal punctuation in a given scenario, these differences may not, ultimately be that different. For my projects at hand, I will lean toward IDNAbis' more restrictive set.

Hence, my current design for work will be essentially IDNAbis 2008 with a pre-sanitization step involving some stripping and NFKC normalization. However, implementing IDNAbis 2008 directly, and practically has some significant problems:

  • Since I want the design to be implementable in readily available environments, I need to ensure that the tests required will be possible. As a strawman, I took the Python 2.4.4 as a baseline. This presents problems for IDNAbis 2008. Many of its tests are based on character properties in the Unicode Character Database. However, Python's unicodedata module offers only a few of the properties from the UCD.
  • This leaves the implementer
  • either the choice of hard coding lists of characters,
  • or writing something that preprocesses the UCD.
  • This isn't very satisfying. Further, IDNAbis 2008's registry of contextual exceptions (not unlike UAX #31's list of restricted contexts), is a poser: Do we define something like identifiers that relies on an external registry of algorithmic fragments? That feels awkward for developers and for code that may be deployed with a rather long shelf life.

Next week I'll be designing a proposal (and Python implementation) for our internal use, working around these issues. I'm happy to share that work here if the group finds it useful. I hope that my experience with this will be able to inform the work here, and the work with identifiers in the VWRAP working group. Also, if anyone wants more details of the work listed above (including the Python code), just ask.

- Mark


At 09:47 13/04/2010, Alexey Melnikov wrote:

‘’how about rename it to "Internationalized string guidelines "?’’

’’right. I agree.’’

I like this better as well. I think it would even be better to say "Internationalized string normalization guidelines" (change "normalization" to "preparation", if you think this is better.)’’

At 21:00 15/04/2010, Marc Blanchet wrote:

At 23:40 16/04/2010, Mark Lentczner wrote

‘’On Apr 7, 2010, at 12:54 PM, Peter Saint-Andre wrote: Proposed Charter for NewPrep WG Version: 0.5 Last Updated: 2010-04-07’’

Looks good in general to me!


4. Internationalization guidelines (BCP). ...

However, I think we are getting way too much into the deep details of document titles...

Indeed! For the charter, I think the above is fine. For general discussion, I have two thoughts:

1) I agree we can't use "normalization" because that has well defined meaning in Unicode

2) I dislike "Internationalization" in this context: It makes me feel like the "other" kind of string is normal, and these are "the special" ones you need to do once you're "a big boy"! Therefore, I'd tend toward titles like "Unicode String Preparation Guidelines" or just "String Preparation Guidelines".

‘’Milestones’’

I'm a little concerned that the new profile documents generally proceed the more general guidelines document. I admit that there is a chicken-and-egg problem here, but I'd worry that we're signing up to deliver profile recommendations for the specific protocols before we've come to closure on the general framework, and so run the risk that those profiles may not conform to the final guidelines.

Mind you, the dates are close, perhaps we are only expecting the dotting of 'ı's and crossing of 't's in the guidelines document after the profiles go to Last Call.

- Mark

At 22:25 20/04/2010, Peter Saint-Andre wrote

‘’ Looks good in general to me!’’

Super, thanks for the feedback.

‘’ 4. Internationalization guidelines (BCP).’’

‘’ String Preparation Guidelines’’

I like "String Preparation Guidelines".

‘’ and so run the risk that those profiles may not conform to the final guidelines’’.

Yes, I think those efforts need to proceed in parallel.

‘’Mind you, the dates are close, perhaps we are only expecting the dotting of 'ı's and crossing of 't's in the guidelines document after the profiles go to Last Call.

That's how I see it.

Nice use of U+0131 there.

At 16:40 22/04/2010, Peter Saint-Andre wrote

‘’This document is probably of interest here...’’

A New Internet-Draft is available from the on-line Internet-Drafts directories.

Title  : Mapping Characters in IDNA2008
Author(s)  : P. Resnick, P. Hoffman
Filename  : draft-resman-idna2008-mappings-01.txt
Pages  : 7
Date  : 2010-04-19

In the original version of the Internationalized Domain Names in Applications (IDNA) protocol, any Unicode code points taken from user input were mapped into a set of Unicode code points that "made sense", and then encoded and passed to the domain name system (DNS). The IDNA2008 protocol presumes that the input to the protocol comes from a set of "permitted" code points, which it then encodes and passes to the DNS, but does not specify what to do with the result of user input. This document describes the actions that can be taken by an implementation between user input and passing permitted code points to the new IDNA protocol.

Personal tools