IDNA and getnameinfo() and getaddrinfo()
From IUCG - Internet Users Contributing Group
1406101926 - Nicolas Williams <Nicolas.Williams@oracle.com>
Hello, I'm not subscribed to this list, so please Cc' me on replies.
Over in the NFSv4 WG we're discussing how to fix NFSv4.1 to properly handle IDNA. In the process of doing so I ran into draft-iab-idn- encoding, which has a cogent discussion of name service switches (pictured in figure 2).
draft-iab-idn-encoding aims for Informational status. I'm wondering if we could publish a Standards-Track document describing how getnameinfo() and getaddrinfo() should handle IDNA.
For example, one could say that when using DNS getnameinfo() should:
- perform the DNS lookup
- apply ToUnicode() to the resulting domainname
- attempt to convert the address' name to the caller's locale's codeset
if that codeset is not UTF-8
- if failure, then return the A-label as the canonical hostname
- if success return the U-label (in the caller's locale's codeset)
as the canonical hostname and the A-label as an alias
And that when using DNS getaddrinfo() should:
- convert the given host/domainname from the caller's locale's codeset
to UTF-8 if necessary
- apply ToASCII(), perform DNS lookups
- if success, return the IP address(es) found, the given name as the
canonical hostname, the A-label form of the hostname as an alias, and the U-label form (converted to the caller's locale's codeset) as an alias if different from the given hostname.
Would you agree? This would greatly simplify the application of IDNA to various application protocols, such as, for example, NFSv4. NFSv4 has several domainname slots, and several more coming from ancilliary protocols in current development. Being able to send un-pre-processed Unicode in NFS because the peer's getaddrinfo() must handle that correctly seems like a very good approach -- this way IDNA does not have to interfere with non-DNS name services.
Unfortunately we probably cannot rely on getnameinfo()/getaddrinfo() doing the Right Thing. A Standards-Track RFC on this would probably help.
Nico
1406102114 - Dave Thaler <dthaler@microsoft.com>
> Hello, I'm not subscribed to this list, so please Cc' me on replies. > > Over in the NFSv4 WG we're discussing how to fix NFSv4.1 to properly handle > IDNA. In the process of doing so I ran into draft-iab-idn- encoding, which has a > cogent discussion of name service switches (pictured in figure 2). > > draft-iab-idn-encoding aims for Informational status. I'm wondering if we could > publish a Standards-Track document describing how getnameinfo() and > getaddrinfo() should handle IDNA. > > For example, one could say that when using DNS getnameinfo() should:
Be careful not to confuse getnameinfo() with DNS. As noted in draft-iab-idn-encoding and RFC 3493, DNS is just one of a number of mechanisms used under getnameinfo().
> > - perform the DNS lookup > - apply ToUnicode() to the resulting domainname > - attempt to convert the address' name to the caller's locale's codeset >if that codeset is not UTF-8 > - if failure, then return the A-label as the canonical hostname > - if success return the U-label (in the caller's locale's codeset) >as the canonical hostname and the A-label as an alias > > And that when using DNS getaddrinfo() should: > > - convert the given host/domainname from the caller's locale's codeset >to UTF-8 if necessary > - apply ToASCII(), perform DNS lookups
As discussed in draft-iab-idn-encoding section 3, it's not that simple. The ACE form applies in the public DNS but does not apply in many private DNS clouds.
> - if success, return the IP address(es) found, the given name as the >canonical hostname, the A-label form of the hostname as an alias, >and the U-label form (converted to the caller's locale's codeset) >as an alias if different from the given hostname.
The addrinfo structure returned by getaddrinfo() does not return "aliases" per se. It can return a single string which is: char*ai_canonname; /* canonical name for nodename */
> > Would you agree? This would greatly simplify the application of IDNA to various > application protocols, such as, for example, NFSv4. NFSv4 has several > domainname slots, and several more coming from ancilliary protocols in current > development. Being able to send un-pre-processed Unicode in NFS because the > peer's getaddrinfo() must handle that correctly seems like a very good approach > -- this way IDNA does not have to interfere with non-DNS name services.
In my view, yes you're on the right track in having NFSv4 not want to do encoding conversion itself for name resolution but in expecting it to be done under getaddrinfo/getnameinfo.
> > Unfortunately we probably cannot rely on getnameinfo()/getaddrinfo() doing > the Right Thing. A Standards-Track RFC on this would probably help.
Well API RFCs (like RFC 3493 for getnameinfo/getaddrinfo) are Informational, not Standards-Track. But yes an RFC would probably help.
-Dave
1406102220 - Dave Thaler <dthaler@microsoft.com>
> On Mon, Jun 14, 2010 at 07:14:12PM +0000, Dave Thaler wrote: <<< Over in the NFSv4 WG we're discussing how to fix NFSv4.1 to properly <<< handle IDNA. In the process of doing so I ran into draft-iab-idn- <<< encoding, which has a cogent discussion of name service switches (pictured > in figure 2). <<< <<< draft-iab-idn-encoding aims for Informational status. I'm wondering <<< if we could publish a Standards-Track document describing how <<< getnameinfo() and <<< getaddrinfo() should handle IDNA. <<< <<< For example, one could say that when using DNS getnameinfo() should: << << Be careful not to confuse getnameinfo() with DNS. > > I explicitly pointed out the name service switch architecture usually > implemented. I thought that'd suffice to clarify that I really meant "the DNS > plug-in to the getnameinfo() entry point in the name service switch" -- I just > didn't want to be too redundant. > << As noted in draft-iab-idn-encoding and RFC 3493, DNS is just one of a << number of mechanisms used under getnameinfo(). > > Right, and I believe the failure to acknowledge this in the original IDNA > architecture was a significant failure. I'm disappointed that though this is being > acknowledged now, it's not in a standards-track document. > <<< - perform the DNS lookup <<< - apply ToUnicode() to the resulting domainname <<< - attempt to convert the address' name to the caller's locale's codeset <<<if that codeset is not UTF-8 <<< - if failure, then return the A-label as the canonical hostname <<< - if success return the U-label (in the caller's locale's codeset) <<<as the canonical hostname and the A-label as an alias <<< <<< And that when using DNS getaddrinfo() should: <<< <<< - convert the given host/domainname from the caller's locale's codeset <<<to UTF-8 if necessary <<< - apply ToASCII(), perform DNS lookups << << As discussed in draft-iab-idn-encoding section 3, it's not that simple. << The ACE form applies in the public DNS but does not apply in many << private DNS clouds. > > I'm not sure I care about those, but one could always implement lists of domains > below which to apply alternative algorithms.
You may not care about them but unfortunately people who provide getaddrinfo/getnameinfo libraries for applications in general need to care about them.
> > I was specifically interested in what name should be returned as canonical and > what name should be returned as an alias, if any. > <<< - if success, return the IP address(es) found, the given name as the <<<canonical hostname, the A-label form of the hostname as an alias, <<<and the U-label form (converted to the caller's locale's codeset) <<<as an alias if different from the given hostname. << << The addrinfo structure returned by getaddrinfo() does not return << "aliases" per se. It can return a single string which is: << char*ai_canonname; /* canonical name for nodename */ > > Oh, right. How depressing. I'd for some reason thought them similar enough to > gethostbyname/gethostbyaddr().
Unfortunately they're not.
> > So remove all references to aliases from my previous post; instead these > functions should return the A-label as the canon name only when the U-label > cannot be converted to the caller's locale's codeset losslessly, else they should > return the U-label (in the caller's locale's codeset) as the canon name.
I'd argue that the "canon name" should be the form in which it was resolved over the wire. So the A-label form if it was resolved in the public DNS, and another form (typically the U-label form) if it was resolved via something else (e.g., mDNS or DNS in a private namespace using UTF-8 or whatever else). Also note that Windows treats "char *" as ANSI (which has no guarantee of interoperability) and hence deprecates getaddrinfo/getnameinfo, and defines UTF-16 versions (GetAddrInfoW/GetNameInfoW). MacOS on the other hand treats "char *" as UTF-8.
RFC 3493 doesn't say either way whether "char *" is ANSI or UTF-8 or whatever else, and as far as I know, neither does POSIX (http://www.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html).
Hence this is an issue for anyone proposing to make a standards-track RFC for getaddrinfo/getnameinfo.
> << In my view, yes you're on the right track in having NFSv4 not want to << do encoding conversion itself for name resolution but in expecting it << to be done under getaddrinfo/getnameinfo. > > Would more advice to protocol designers be appropriate then? When should > application protocols (ingoring domainname registration related > protocols) care to specify A-labels-only, U-labels-only, both, or un-pre- > processed Unicode? > > If we could assume IDNA-aware getnameinfo()/getaddrinfo() then is there any > reason for application protocols [that don't involve domainname registration] to > do anything other than allow all three forms (A-label, U-label and un-pre- > processed Unicode) on the wire?
I'd argue any new application protocol ought to specify the encoding rather than allowing multiple.Specifying UTF-8 would be good :-)
-Dave
1406102233 - Simon Josefsson <simon@josefsson.org>
> RFC 3493 doesn't say either way whether "char *" is ANSI or UTF-8 or whatever > else, and as far as I know, neither does POSIX > (http://www.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html).
Normally in POSIX, strings are encoded in the locale coding system. If you are in a UTF-8 locale, the string can be assumed (by the getaddrinfo implementation) to be UTF-8. Otherwise it needs to be transcoded. This is how GNU Libc's IDN support works, see:
http://git.savannah.gnu.org/cgit/libidn.git/tree/libc/getaddrinfo-idn.txt
/Simon
1506100142 - Nicolas Williams <Nicolas.Williams@oracle.com>
<< I'm not sure I care about those, but one could always implement lists of domains > Normally in POSIX, strings are encoded in the locale coding system. If << below which to apply alternative algorithms. > you are in a UTF-8 locale, the string can be assumed (by the getaddrinfo > > implementation) to be UTF-8. Otherwise it needs to be transcoded. This > is how GNU Libc's IDN support works, see: > You may not care about them but unfortunately people who provide > > getaddrinfo/getnameinfo libraries for applications in general need to > care about them. > http://git.savannah.gnu.org/cgit/libidn.git/tree/libc/getaddrinfo-idn.txt
For the matter of this discussion, I don't care. If I were implementing
I like that approach.
I'd consider providing a local administrative configuration interface by
which to provide lists of private cloud domains that use alternative IDN
schemes. (Actually, I'd probably want a distributed configuration
method for that, preferably using DNS itself, but really, that's a
tangent I don't want to go on because it's a distraction from the
purpose of this thread.)
<< So remove all references to aliases from my previous post; instead these << functions should return the A-label as the canon name only when the U-label << cannot be converted to the caller's locale's codeset losslessly, else they should << return the U-label (in the caller's locale's codeset) as the canon name. > > I'd argue that the "canon name" should be the form in which it was > resolved over the wire. So the A-label form if it was resolved in the public DNS, > and another form (typically the U-label form) if it was resolved via something > else (e.g., mDNS or DNS in a private namespace using UTF-8 or whatever else). > Also note that Windows treats "char *" as ANSI (which has no guarantee of > interoperability) and hence deprecates getaddrinfo/getnameinfo, and defines > UTF-16 versions (GetAddrInfoW/GetNameInfoW). > MacOS on the other hand treats "char *" as UTF-8.
Better yet, Simon's poposal allows the caller to decide which name should be returned as canonical. That works for me.
> RFC 3493 doesn't say either way whether "char *" is ANSI or UTF-8 or whatever > else, and as far as I know, neither does POSIX > (http://www.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html).
See Simon's reply.
> Hence this is an issue for anyone proposing to make a standards-track RFC for > getaddrinfo/getnameinfo.
I'd be willing to specify new functions with different names if need be. But it seems me that the between getaddrinfo()'s hints and getnameinfo()'s flags arguments we have enough room for extensibility without resorting to new function names.
<< If we could assume IDNA-aware getnameinfo()/getaddrinfo() then is there any << reason for application protocols [that don't involve domainname registration] to << do anything other than allow all three forms (A-label, U-label and un-pre- << processed Unicode) on the wire? > > I'd argue any new application protocol ought to specify the encoding rather than > allowing multiple.Specifying UTF-8 would be good :-)
Just UTF-8, un-pre-processed, raw user input? Or did you mean U-labels?
Also, with respect to deployed protocols that have protocol elements for carrying domainnames, where those protocol elements are defined as carrying UTF-8, but where in practice most implementors did not actually code those slots as IDN-aware, wouldn't it be a strong presumption that the slots are IDN-unaware?
1506100930 - Internet-Drafts@ietf.org <Internet-Drafts@ietf.org
Title: Internationalized Domain Names support in POSIX getaddrinfo
- Author(s): S. Josefsson
Filename: draft-josefsson-getaddrinfo-idn-00.txt This document describes an extension for Internationalized Domain Names support in the POSIX getaddrinfo function.
A URL for this Internet-Draft is: http://www.ietf.org/internet-drafts/draft-josefsson-getaddrinfo-idn-00.txt
1506100940 - Simon Josefsson <simon@josefsson.org>
I've published my getaddrinfo extension writeup as an IETF document, to facilitate discussions. See announcement below.
It isn't clear from the writeup, but this code has been shipping in GNU Libc for many years, so if you are running GNU/Linux chances are you have had this functionality for quite some time. You can test it through a code snippet like this:
http://git.savannah.gnu.org/cgit/libidn.git/tree/libc/example.c
/Simon
1506101735 - jefsey <jefsey@jefsey.com>
I am afraid my Franglish may have got you wrong. IMHO IDNA2008 is perfect (it would have not reached consensus otherwise). IDNA2003 was not, because its basic concept of IDNA is flawed. The IAB I_D is a correct entry point to correct IDNA. The WG/IDNABIS "Mapping" consensual document is a correct entry point/example of the work to be achieved to properly encapsulate IDNA2008 on the use(r)side (what we call IDNA2010 [so it is not confused, as the ccNSO started doing it, with a new version of IDNA2008]. Hence IDNA2012 is the coordination between IDNA2008 on the netside and IDNA2010 on the use(r)side, until someone comes with a better and complete Internet Domain Name (IDN) framework.
On Tue, Jun 15, 2010 at 03:16:08AM +0200, JFC Morfin wrote:
> if this may help:
>
> At 01:42 15/06/2010, Nicolas Williams wrote:
<<(Actually, I'd probably want a distributed configuration
<<method for that, preferably using DNS itself, but really, that's a
<<tangent I don't want to go on because it's a distraction from the
<<purpose of this thread.)
>
> I think this is a good position at this time. I have an appeal at
> the IAB level now, over the lack of documentation provided on the
> IDNA2008 use side. The target is to get an IAB guidance on the
> fundamental architectural issues involved, who should document them,
> and address the questions like the one you raise (I will include
> this thread among the IDNA2010 matters to be addressed in order to
> interoperate IDNA2008 in the userside).
If you wish. I think only the only piece of IDNAbis that is missing from the Standards-Track, and which really ought to be there, is the getnameinfo()/getaddrinfo() advice (specifications, preferably).
This is precisely the entire matter of the work to be engaged over the diversity of their specifications.
I'm not interested in holding up IDNAbis at this time for that work, and I don't see any reason why it wouldn't get done. In other words, I don't think your appeal is necessary.
The purpose of the appeal (which was explained from an Internet User Point of View in a document the IESG reviewed prior to approving IDNA2008) is to make:
- (1) the Mapping document and I_D demands a published and accepted pre-requisite to the consideration of IDNA2008.
- (2) IAB state where and by who IDNA2010 and IDNA2012 should be discussed because they concern topics which may be considered as foreign to the IETF Internet as the W3C's web.
- (3) IAB start criticizing the IUCG architectural positions, as per the IUCG charter, prior to the description of their RFC1958/RFC439/IDNA2008 Internet model and its IUI (Internet Use Interface) testing based on users' architecture (Interplus) and an operational prototype.
> Once IAB has clarified "who" is and "where" to coordinate the work, > in responding or in disregarding this appeal, we (internet users of > IUCG) will know how to document a non conflicting counterpart of > IDNA2008 on the userside. This extension includes what we call the > "ML-DNS", standing for a multi-layer DNS encapsulation front-end. > This might be a simple and robust place to address your need along > the basic IAB I_D recommandation: "conversion to A-label form, > UTF-8, or any other encoding, should be done only by an entity that > knows which protocol will be used".
The fundamental problem of I18N is keeping track of the codeset that a string is encoded in, and since we don't tag strings with this information we can only do codeset conversions at boundaries where we know the codesets used on either side -- if ever there's any ambiguity, we lose (to some degree; more on this some other time). You're quite correct that IDNAbis really needed to do more here to help developers identify these boundaries and determine what to do at each boundary.
However, I don't think that IDNAbis is in trouble for not doing more with regards to this: we can still identify such boundaries, and we can analyze the various problems and determine whether IDNAbis is fundamentally flawed in the sense that there may be areas where design choices behind IDNAbis necessarily cause ambiguities which cannot be resolved. I would have preferred to see this analysis done in the core of IDNAbis, or, really, and much better, the original IDNA. However, because we already have IDNA and I don't see how IDNAbis makes matters worse, I'm willing to give IDNAbis a pass on this. Had I been involved from the get-go with IDNAbis, instead of this late, I'd probably take a different point of view.
I believe that draft-iab-idn-encoding gets close to the heart of what's missing from the IDNAbis effort: APIs that take into account the Real World. In the Real World we have name service switches, and multiple name services with different approaches to I18N. The low-level interfaces produced by IDNA (ToASCII() and ToUnicode()) are not sufficient for making IDNA work in the Real World -- operating system help is needed, or else we will see applications abandoning name service switches, which means abandoning name services other than DNS (which, while one might want to see that happen, I wouldn't want IDNA to be the agent that forces such a change). Now, the operating system can help, but if we want speedy adoption, then we need interface specifications that have a chance of getting implemented and to be _portable_, which in turn requires API standards, starting with C programming language bindings.
Looking at Simon's proposal for getnameinfo()/getaddrinfo(), I like, a lot. Between getnameinfo()/getaddrinfo() IDNA extensions and ToASCII() and ToUnicode() (C bindings please!) I think developers have enough to cover most common boundaries. Where other boundaries are obscured, as by the name service switch in the case of hostname/IP address resolution, we'll need additional interfaces, and we can tackle those as we come to them.
For example, NFSv4 uses domainnames to identify user and group namespaces. These could be DNS domainnames from which non-DNS domains are identified, such as LDAP DITs. LDAP's DC name attribute component is IDN-unaware, therefore some part of an NFSv4 stack may have to apply ToASCII()/ToUnicode() judiciously in that context. Or maybe an LDAP library would take care of such issues (sadly there's no standard for deriving or matching base DNs from DNS domainnames, though there are common conventions). This case is not a very good one because there's no ambiguity -- at worst we have the same ambiguity as in the name service switch case (since user/group names/IDs, qualified with domainnames, might be passed to the name service switch for resolution). I'd be particularly concerned with other, more difficult cases.
My point was only to support your priority order: once all this is clarified and documented, we will be in better position to discuss it.
Cheers.
jfc
1506101809 - Shawn Steele <Shawn.Steele@microsoft.com>
FWIW: I don't think that applications should need to understand how DNS works. (Something of a seperation of business logic concept as probably taught in, like, CS101 - "Don't make your app know more than it has to.") IMO it'd be nice if app developers that need to open a connection to a server had all the Punycode ugliness layered away by some nice set of DNS APIs, or even higher level at the open connection APIs or whatever.
Unforunately, Punycode means that some apps will want to decode the string anyway because they'd like pretty names. Some sort of getcanonicalname() or something could help there. I realize there's an "A" in "IDNA", but if every app has to do punycode conversion themselves there's going to be tons of odd inconsistencies in what they're doing. It could also mean "tweaking" thousands of apps if the IDNA20xx rules change a little. (Eg: like the bidi rules did this go around).
The good thing about Punycode/IDN is that it enabled DNS. The bad thing is that suddenly any network app needs to become a DNS expert.
-Shawn
1506101851 - Nicolas Williams <Nicolas.Williams@oracle.com>
> FWIW: I don't think that applications should need to understand how > DNS works. (Something of a seperation of business logic concept as > probably taught in, like, CS101 - "Don't make your app know more than > it has to.") IMO it'd be nice if app developers that need to open a > connection to a server had all the Punycode ugliness layered away by > some nice set of DNS APIs, or even higher level at the open connection > APIs or whatever.
Simon's extensions for getname/addrinfo() are the kind of APIs I'm looking for. I'm not sure if that's what you have in mind, but I'd love to hear about it.
> Unforunately, Punycode means that some apps will want to decode the > string anyway because they'd like pretty names. Some sort of > getcanonicalname() or something could help there. I realize there's > an "A" in "IDNA", but if every app has to do punycode conversion > themselves there's going to be tons of odd inconsistencies in what > they're doing. It could also mean "tweaking" thousands of apps if the > IDNA20xx rules change a little. (Eg: like the bidi rules did this go > around).
If an application is just dealing with hostnames then it's easy: ToUnicode(). If the application needs hostname<->IP address resolution then it needs something more like getname/addrinfo() with suitable enhancements. If the application needs to deal with e-mail addresses, IRIs, etcetera, then the application is going to need APIs that are specific to those. The alternative is that the application must implement all the relevant rules, all the time, and as draft-iab-idn- encoding explains, that isn't always possible (the example being the name service switch in typical operating systems). It will be better to have context-specific APIs.
> The good thing about Punycode/IDN is that it enabled DNS. The bad > thing is that suddenly any network app needs to become a DNS expert.
To some degree that's unavoidable, but with context-specific APIs we can minimize the burden on applications. We should do just that if we want IDNA to succeed beyond the web browser (do we?).
Nico
1506101854 - Shawn Steele <Shawn.Steele@microsoft.com>
We're in agreement, I think. I'd rather have IDN work for getname, etc. by default though. (Then maybe it'd "just work"?) Instead of ToUnicode(), which is very specific, I'd prefer a more general "GetPrettyDNSName()." Then, if more steps than just ToUnicode() are ever interesting, the gory details are hidden from the app. Specifially, on machines hosting both UTF-8 (or other code page) and Punycode DNS, the actual form of "pretty" could differ depending on how a name was looked up.
For email, at least, UTF8SMTP is much smarter. The DNS layer is muddy still, but at least the addresses are "just utf-8", and don't get to this state of having strange encodings leaking all over the place.
-Shawn
1506101940 - Nicolas Williams <Nicolas.Williams@oracle.com
Shawn Steele wrote: > We're in agreement, I think. I'd rather have IDN work for getname,
Probably.
> etc. by default though. (Then maybe it'd "just work"?) Instead of
Indeed, I think I might even want to reverse Simon's flags' semantics: the default behavior should be to use U-labels as canonical and to support searches by un-prepared text.
However, if Simon has already deployed his extensions... Simon?
I think Simon's playing it safe, and perhaps a conservative approach that results in A-labels by default in UIs is better. I'll have to think about this, but my gut feeling is that there's no reason that we couldn't reverse the default sense of Simon's extensions.
> ToUnicode(), which is very specific, I'd prefer a more general > "GetPrettyDNSName()." Then, if more steps than just ToUnicode() are > ever interesting, the gory details are hidden from the app.
At the abstract API level I think ToUnicode() is perfectly fine. For actual programming language bindings something else is needed to provide the context. For a language with "package" names that something else could be a package name -- "DNS::IDNA", say. For a language like C, with a flat namespace we'd need the function name to be more indicative of what it does, as in your suggestion.
1506102018 - Shawn Steele <Shawn.Steele@microsoft.com>
> I think Simon's playing it safe, and perhaps a conservative approach > that results in A-labels by default in UIs is better.
Depends, perhaps, on your perspective. I think international users would be happiest if the Unicode forms worked by default. It makes sense if some app didn't want that behavior, but adding a flag means everyone has to recompile. Hopefully (perhaps naively), some apps might work without that step if it handled IDN by default. My personal preference anyway.
Shawn
1506102025 - Nicolas Williams <Nicolas.Williams@oracle.com>
> I've published my getaddrinfo extension writeup as an IETF document, to > facilitate discussions. See announcement below.
Thank you Simon. This is very useful.
> It isn't clear from the writeup, but this code has been shipping in GNU > Libc for many years, so if you are running GNU/Linux chances are you > have had this functionality for quite some time. You can test it > through a code snippet like this:
Ah, it's good to know that this is shipping.
I think it might be a good idea to add AI_NO_IDN and AI_CANON_NO_IDN flags. That way in the future we could change these functions to behave by default as if the application had used AI_IDN/AI_CANONIDN.
I can see the default behavior of getaddrinfo()/getnameinfo() being configurable system-wide, at least initially until we obtain enough deployment experience. Specifically, I'd like to see what, if anything, breaks if getaddrinfo()/getnameinfo() act as if the application had used AI_IDN/AI_CANONIDN; I suspect there will be very little breakage, and if that's so then I think we can stand to get much more benefit from having that be the default behavior than from it being optional.
Comments?
Nico
1506102029 - Nicolas Williams <Nicolas.Williams@oracle.com>
<< I think Simon's playing it safe, and perhaps a conservative approach << that results in A-labels by default in UIs is better. > > Depends, perhaps, on your perspective. I think international users > would be happiest if the Unicode forms worked by default. It makes > sense if some app didn't want that behavior, but adding a flag means > everyone has to recompile. Hopefully (perhaps naively), some apps > might work without that step if it handled IDN by default. My > personal preference anyway.
I don't have enough experience here, but intuitively I believe that defaults which result in less A-label UI leakage should be better (as long as we don't then end up with U-label leakage into IDN-unaware domainname slots -- that's the risk).
Since Simon's extensions ahve shipped, we cannot reverse their sense. Therefore I propose that we add a pair of flags with the reverse sense, and then allow the default behavior to be implementation-specific and/or locally configurable. Then we'll be able to see what, if anything, breaks by having getnameinfo() return U-labels by default and getnameinfo() supporting unprepared inputs.
Nico
1506102345 - Simon Josefsson <simon@josefsson.org>
> Ah, it's good to know that this is shipping. > > I think it might be a good idea to add AI_NO_IDN and AI_CANON_NO_IDN > flags. That way in the future we could change these functions to behave > by default as if the application had used AI_IDN/AI_CANONIDN. > > I can see the default behavior of getaddrinfo()/getnameinfo() being > configurable system-wide, at least initially until we obtain enough > deployment experience. Specifically, I'd like to see what, if anything, > breaks if getaddrinfo()/getnameinfo() act as if the application had used > AI_IDN/AI_CANONIDN; I suspect there will be very little breakage, and if > that's so then I think we can stand to get much more benefit from having > that be the default behavior than from it being optional.
That could be tried in an experiment, but I believe any effort to make that the default, or even any effort to standardize anything here, should be co-ordinated with the Austin group. That is also really the place where the POSIX experts hang out and can tell whether this is a good idea or not, from a POSIX standardization point of view.
I recall raising this with them in the past, but it was premature at that point. Perhaps now is a better time.
/Simon
1506102348 - Anthony Jones <ajones@microsoft.com>
I would prefer the default behavior to be as if AI_IDN/AI_CANONIDN were specified. As noted in other responses, requiring applications to know about the different encodings will likely result in some apps still getting it wrong. In my opinion, having the platform automatically perform IDN is preferable and then provide flags for opting out. I think having other configurable knobs is a good idea as well. For example, being able to opt either the entire system out of the IDN behavior or on a per application basis. There may also be instances where a given domain has both Punycode and UTF-8 resources such that a policy mechanism for giving a specific encoding for a FQDN or domain would also be useful.
1506102348 - Nicolas Williams <Nicolas.Williams@oracle.com>
> That could be tried in an experiment, but I believe any effort to make Simon Josefsson wrote: > That could be tried in an experiment, but I believe any effort to make > that the default, or even any effort to standardize anything here, > should be co-ordinated with the Austin group. That is also really the > that the default, or even any effort to standardize anything here, > place where the POSIX experts hang out and can tell whether this is a > should be co-ordinated with the Austin group. That is also really the > good idea or not, from a POSIX standardization point of view. > place where the POSIX experts hang out and can tell whether this is a > good idea or not, from a POSIX standardization point of view.
Certainly making it a default should be, but making it configurable,
Hmmm, actually, POSIX/SUS cannot really control what is a canonical probably not. hostname here. That's for the name services standards to decide, not
> I recall raising this with them in the past, but it was premature at for the API. And since these functions do not list aliases, and since > that point. Perhaps now is a better time. aliases are clearly supported by a variety of name service, it's clear that:
It's getting painfully necessary, IMO.
- making getaddrinfo() support lookups by IDNs encoded in the caller's
locale's codeset, converting to Unicode and applying IDNA as Nico necessary, is perfectly legitimate
- making getaddrinfo() return either A-labels or U-labels (in the
converted to the caller's locale's codeset) as ai_canonname when AI_CANONNAME is requested is also perfectly legitimate
However, returning anything other than A-labels risks leaking of IDNs into IDN-unaware slots by IDN-unaware applications.
Therefore I think that getaddrinfo() should support lookups by IDNs encoded in the caller's locale's codeset as described above, but should return A-labels unless a hint is given that U-labels are preferred.
1606102354 - Nicolas Williams <Nicolas.Williams@oracle.com
> For the matter of this discussion, I don't care. If I were implementing > I'd consider providing a local administrative configuration interface by > which to provide lists of private cloud domains that use alternative IDN > schemes. (Actually, I'd probably want a distributed configuration > method for that, preferably using DNS itself, but really, that's a > tangent I don't want to go on because it's a distraction from the > purpose of this thread.)
Alright, let's explore that. What would that look like? It could look like a simple NS-like RR -- let's call it the IDNRULES RR. The RRset name would be a domainname for which the same zone also has NS RRs. The RDATA would indicate what IDN rules apply.
A contrived example for foó.example., foóbar.example. and óther.example.:
xn--fo-6ja.example. IN IDNRULES "IDNA2008" xn--fo-6ja.example. IN IDNRULES "UTF-8;tolower;NFC" xn--fobar-1ta.example. IN IDNRULES "UTF-8" xn--ther-pqa.example.IN IDNRULES "ISO8859-1"
These RRsets would mean that:
- foó.example. supports IDNs encoded as A-labels per-IDNA2008
- foó.example. _also_ supports IDNs encoded in UTF-8 provided that
names are first case-folded to lower case, then normalized to NFC
- foóbar.example. supports IDNs encoded in UTF-8 without having to
casefold nor normalize; presumably the server can do normalization- and case-insensitive matching (preserving too) -- pretty advanced, and probably no server on the market supports this.
- óther.example. supports only IDNs encoded in ISO8859-1 (the server
may or may not know how to do case-insensitive matching for non-ASCII ISO8859-1 characters)
So, to resolve tést.{foó, foóbar, óther}.example. the _resolver_ would first have to split the input string into labels using whatever fullstops are legal in the current locale, then lookup each of those domains' IDNA rules in the example. TLD zone, do whatever codeset conversions and pre-processing may be required to meet the rules found, then do the next query. And so on.
Sounds good, BUT there's issues w.r.t. stub resolvers and caching: stub resolvers suddenly have to get pretty fancy, even if the are using caching servers, because suddenly recursive caching servers are not useful for looking up IDNs!
Makes you think that private DNS clouds with IDN rules other than IETF Standards-Track IDNA rules are not desirable. And I'd agree.
What's the point of this post? First: to note that private DNS clouds with non-standard IDN rules are a big PITA since right now they can only be supported by nodes that either happen to implement those rules (and not IDNA) or which have local configuration partitioning the DNS namespace by IDN rulesets, and distributed configuration, though it could be possible, would also be a PITA since stub resolvers would have to get pretty smart. Second: to outline a meta-IDN system that could work if IDNA2008 should founder (but let's hope not). Third: I had to write this down :)
1706100328 - John C Klensin <klensin@jck.com>
> So, to resolve tést.{foó, foóbar, óther}.example. the > _resolver_ would first have to split the input string into > labels using whatever fullstops are legal in the current > locale, then lookup each of those domains' IDNA rules in the > example. TLD zone, do whatever codeset conversions and > pre-processing may be required to meet the rules found, then > do the next query. And so on.
Well, remember that, if fullstops are not global, one needs to be very careful to keep local ones from leaking. If they do leak, a parser that tries to separate an FQDN into labels will end up with a high error rate. That would make the bad guys, who have lots of fun with URLs that trick users into believing that third- or fourth-level names are really second-level ones, very happy. I trust their happiness is not our goal.
> Sounds good, BUT there's issues w.r.t. stub resolvers and > caching: stub resolvers suddenly have to get pretty fancy, > even if the are using caching servers, because suddenly > recursive caching servers are not useful for looking up IDNs!
Right. And, if you start thinking about DNAME and other things that prevent you from knowing definitively which tree someone thinks that a name/label is in, the difficulties with caching servers start looking easy.Remember that there is not even an inherent DNS restriction that would prevent having a label in a private namespace for a DNAME RR whose Data points into the public one DNS.
> Makes you think that private DNS clouds with IDN rules other > than IETF Standards-Track IDNA rules are not desirable. And > I'd agree. > > What's the point of this post? First: to note that private > DNS clouds with non-standard IDN rules are a big PITA since > right now they can only be supported by nodes that either > happen to implement those rules (and not IDNA) or which have > local configuration partitioning the DNS namespace by IDN > rulesets, and distributed configuration, though it could be > possible, would also be a PITA since stub resolvers would have > to get pretty smart. Second: to outline a meta-IDN system > that could work if IDNA2008 should founder (but let's hope > not). Third: I had to write this down :)
I think there may be a fundamental misunderstanding here. If your point is that we have a mess on our hands, we already know that... and that is starting point for this document.
Could the mess have been avoided if the implications of the native UTF-8 (and other native encodings, such as direct use of 8859-1) had been known and analyzed when the IDNA work was being done? Well, perhaps, but actually I have serious doubts. The public-DNS TLDs that were selling 8859-1 names prior to IDNA2003 really didn't care -- they were in the name-selling business and, if some of those names weren't able to be used in applications... well, buyer beware. The decision to wrap IDNA around an ACE was made fairly consciously and with a moderately good understanding of what we were getting into. If we had understood that better, or made different tradeoffs, the answers might have come out a little different but I don't think very much. And, while the Punycode algorithm and encoding takes the heat in the current draft, it is difficult to understand how any other ACE encoding would have been much better.
Now, this particular mess could have been avoided almost entirely had the IDN WG decided to use UTF-8 in the DNS instead of going through Nameprep and an ACE. The WG decided to not do that, partially because it, perhaps unlike some of the private implementations that are now using UTF-8 directly, understood that user expectations and matching issues required normalization and careful attention to matching procedures and that getting the DNS to do that and applications to accept it would result in a _very_ long implementation and deployment curve. And the WG decided that deployment time was important and that a long time before general availability was intolerable. Real tradeoff there.
Note that one of the advantages private namespaces have over public ones is that they are typically fairly homogeneous wrt software, management, or both.If I know that all names will be canonicalized in the same way and that they will be used within a single, homogeneous community, I may not need fancy normalization and matching rules built into a protocol. Indeed, I may never notice the absence of that machinery.
But, if we had a situation in which the public namespaces were using IDNA2003 UTF-8 strings, and the private ones were using unmodified/ unmapped UTF-8 strings, we would still have a problem because we could get false matches in both environments depending on the assumptions made. That problem gets a lot better if everyone is using U-labels. Of course, that was a key reason why IDNA2008 doesn't have mapping in the protocol but, as I trust everyone reading this knows, that decision is not without problems in practice... and it is precisely where these traditionally-different approaches interact that the collision between theory and practice gets most severe.
One more recent set of decisions is reminiscent of the IDNA ACE/Punycode one. If there were no IDN TLDs and, preferably, a very small and infrequently-changing number of TLDs total, then it would be fairly easy to devise ways to distinguish between UTF-8-using private namespaces and A-label-using public ones. ICANN has not seemed to be very interested in that issue and the tradeoffs it implies.
In this context, Shawn wrote:
> The good thing about Punycode/IDN is that it enabled DNS. The > bad thing is that suddenly any network app needs to become a > DNS expert.
Borrowing a theme from another discussion that has been going on in parallel, the good thing about getnameinfo and getaddrinfo are that they enable IPv6. The bad thing is that suddenly any network app needs to become a routing preferences expert.As Ned Freed pointed out in that context, if you really want this to be transparent to the application, the relevant interface is some flavor of "SetupConnectionByName" with which the application starts with an opaque name and then, subject to some parameters or function-name variations, ends up with a connection. Sadly, taking away the need for expert knowledge of the DNS alone really doesn't help a lot. I suggest that, ultimately, the main purpose of the encoding document is to identify the problem(s), warn people to exercise caution, and to make a few suggestions that may help a bit.
1706100339 - Paul Hoffman <phoffman@imc.org>
>Borrowing a theme from another discussion that has been going on >in parallel, the good thing about getnameinfo and getaddrinfo >are that they enable IPv6. The bad thing is that suddenly any >network app needs to become a routing preferences expert.
+1. Some complexity can be added along a smooth curve; other complexity takes a changed mindset (and, in these cases, a changed API). Wishing it weren't so is fine, but by doing so you are rapidly reduced to "get off my lawn, you kids!".
1706101012 - Simon Josefsson <simon@josefsson.org>
> - making getaddrinfo() support lookups by IDNs encoded in the caller's >locale's codeset, converting to Unicode and applying IDNA as >necessary, is perfectly legitimate
Certainly.
> - making getaddrinfo() return either A-labels or U-labels (in the >converted to the caller's locale's codeset) as ai_canonname when >AI_CANONNAME is requested is also perfectly legitimate
Yes.
> However, returning anything other than A-labels risks leaking of IDNs > into IDN-unaware slots by IDN-unaware applications.
True.
> Therefore I think that getaddrinfo() should support lookups by IDNs > encoded in the caller's locale's codeset as described above, but should > return A-labels unless a hint is given that U-labels are preferred.
In other words, you want AI_IDN to be the default behaviour, but not AI_CANONIDN. There likely needs to be an AI_NO_IDN or similar if an application for some reason wants to disable IDN processing of inputs. I'm not sure what a use-case for AI_NO_IDN would be, but I suspect there may be some.
1706101448 - JFC Morfin <jefsey@jefsey.com>
+1. This is why we need to carefully consider where/how to apply the RFC 3439 principle of simplicity. This is implied in the way IDNA2008 addresses IDNs. However, before engaging in any new construct that will consistently impact the nets in many other situations (as Nicolas shows it), one has to carefully agree on what IDNA2008 actually implies or what IAB wants the IETF to consider it implies. This means an appropriate IETF Internet simplicity model at the proper common layer.
1706101811 - Nicolas Williams <Nicolas.Williams@oracle.com>
> I think there may be a fundamental misunderstanding here. If > your point is that we have a mess on our hands, we already know > that... and that is starting point for this document.
See "First: to note that...". That is: I don't think DNS, much less _application_ implementors can be expected to support private DNS clouds with non-standard IDN rules. It's just too big a PITA.
My point was definitely not that we have a mess on our hands, UNLESS we want implementors to support private DNS clouds with non-standard IDN rules. But how can we? If such rules are non-standard...
The only obvious non-standard rule that could "trivially" be supported is "raw IDNs without regard to codeset; every node must run in the same locale". And even then, how would an implementation know when to apply that rule versus IDNA?
> Could the mess have been avoided if the implications of the > native UTF-8 (and other native encodings, such as direct use of > 8859-1) had been known and analyzed when the IDNA work was being > done? Well, perhaps, but actually I have serious doubts. The > ...
That was not my point. It's possible that my little straw man multi-IDNA proposal could have been useful a decade ago. Right now it's just a device for reasoning around the exhortation I received earlier to consider private cloud DNS with non-standard IDN rules.
> I suggest that, ultimately, the main purpose of the encoding > document is to identify the problem(s), warn people to exercise > caution, and to make a few suggestions that may help a bit.
Yes, and it's done that job. I think it's time to standardize getaddrinfo()/getnameinfo() behavior.
1706101815 - Nicolas Williams <Nicolas.Williams@oracle.com>
> In other words, you want AI_IDN to be the default behaviour, but not > AI_CANONIDN.
Right.
>There likely needs to be an AI_NO_IDN or similar if an > application for some reason wants to disable IDN processing of inputs.
Yes.
> I'm not sure what a use-case for AI_NO_IDN would be, but I suspect there > may be some.
There's no harm in having such a flag.
I also want a AI_NO_CANONIDN, in preparation for a future day when getaddrinfo() returns U-labels as the canonical hostname by default.
1706101955 - Nicolas Williams <Nicolas.Williams@oracle.com>
> Well, remember that, if fullstops are not global, one needs to > be very careful to keep local ones from leaking. If they do
Since I was concerning myself with the DNS protocol in particular, there is no such concern (full stops don't appear in DNS the protocol).
> leak, a parser that tries to separate an FQDN into labels will > end up with a high error rate. That would make the bad guys, > who have lots of fun with URLs that trick users into believing > that third- or fourth-level names are really second-level ones, > very happy. I trust their happiness is not our goal.
Very good point. Full stops need to be globally defined for all locales.
Of course, my proposal was a strawman, intended primarily to show that we cannot be expected to support private DNS clouds with non-standard IDN rules.
<< Sounds good, BUT there's issues w.r.t. stub resolvers and << caching: stub resolvers suddenly have to get pretty fancy, << even if the are using caching servers, because suddenly << recursive caching servers are not useful for looking up IDNs! > > Right. And, if you start thinking about DNAME and other things > that prevent you from knowing definitively which tree someone > thinks that a name/label is in, the difficulties with caching > servers start looking easy.Remember that there is not even an > inherent DNS restriction that would prevent having a label in a > private namespace for a DNAME RR whose Data points into the > public one DNS.
I've not thought about that enough, but I suspect that one could setup the IDN meta-rules so that this is not a problem: each label in any (and all) FQDNs need to be handled according to the IDN rules advertised for the containing zone, and every FQDN that appears in any one zone must be encoded according to that zone's advertised IDN rules.
> Could the mess have been avoided if the implications of the > native UTF-8 (and other native encodings, such as direct use of > 8859-1) had been known and analyzed when the IDNA work was being > done? Well, perhaps, but actually I have serious doubts. The > public-DNS TLDs that were selling 8859-1 names prior to IDNA2003 > really didn't care -- they were in the name-selling business > and, if some of those names weren't able to be used in > applications... well, buyer beware. The decision to wrap IDNA > around an ACE was made fairly consciously and with a moderately > good understanding of what we were getting into. If we had > understood that better, or made different tradeoffs, the answers > might have come out a little different but I don't think very > much. And, while the Punycode algorithm and encoding takes the > heat in the current draft, it is difficult to understand how any > other ACE encoding would have been much better.
I think we could have insisted on using UTF-8 on the wire in DNS. Yes, that would have taken time to adopt as plenty of legacy deployments might have needed upgrading. However, it's taken a very long time for IDNA to be adopted as well (particularly outside web browsers), and DNS security vulnerabilities have meant that many/most legacy DNS deployments did get updated. Then we could have avoided ACE and Punycode altogether.
However, I'm not proposing that we cry over spilled milk. If you've read my posts on this list in the past week then you know that I'm trying to make IDNA easier on applications by promoting better APIs (see my comments regarding getaddrinfo() and getnameinfo()).
> Now, this particular mess could have been avoided almost > entirely had the IDN WG decided to use UTF-8 in the DNS instead > of going through Nameprep and an ACE. The WG decided to not do > that, partially because it, perhaps unlike some of the private > implementations that are now using UTF-8 directly, understood > that user expectations and matching issues required > normalization and careful attention to matching procedures and > that getting the DNS to do that and applications to accept it > would result in a _very_ long implementation and deployment > curve. And the WG decided that deployment time was important > and that a long time before general availability was > intolerable. Real tradeoff there.
_That_ understanding that normalization is required was sorely _mistaken_.
Don't get me wrong: of course Unicode normalization matters.
But what we all missed for so long was that normalization-insensitive matching is possible, just as case-insensitive matching is. I know this because that's exactly what we implemented in ZFS in OpenSolaris for filename lookups.
The traditional DNS was case-insenstive/case-preserving. Making a Unicode-aware DNS be normalization-insensitive/preserving was, in _retrospect_, equally feasible.
However, we're now _stuck_ with ACE, which means: we're stuck with _clients_ (not servers) having to casefold and normalize IDNs, which results in different semantics than the traditional DNS. I think we can live with this, if nothing else because now we must :/
This mistake was really the result of the Unicode Consortium focusing of the process of normalization and not on how it should be used by _developers_. A straightforward implementation of normalization as described by the UC requires allocating memory in many cases, and when it doesn't it may still result in a destructive operation on the input string -- both of these being extremely undesirable side-effects when all one wants to do is compare strings.
I keep coming back to this: we need to consider APIs, we need to consider the real world as it looks to _developers_, the people who write the bloody code. By "we" I mean: standards-setting organizations.
If the UC had described normalization-insensitive string comparison a decade ago, then more lightbulbs might have gone off a decade ago.
In retrospect, to me, normalization-insensitive string comparison is a blindingly obvious idea. Of course, having been in the thick of it when we decided to go that way in ZFS, I know it wasn't really that obvious. But I believe it likely would have been if the UC had considered things from developers' points of view.
> Note that one of the advantages private namespaces have over > public ones is that they are typically fairly homogeneous wrt > software, management, or both.[...]
Are they really homogeneous? I doubt it. Or at least I doubt that they'll stay that way indefinitely. Deploying private namespaces with alternative IDN rules seems like a terrible idea to me, something we should discourage.
> But, if we had a situation in which the public namespaces were > using IDNA2003 UTF-8 strings, and the private ones were using > unmodified/ unmapped UTF-8 strings, we would still have a > problem because we could get false matches in both environments > depending on the assumptions made. [...]
Not at all! We'd have invented normalization-insensitiveness sooner to deal with that.
(Other differences involving various mappings and codepoint prohibitions would have been few and far between, and also best handled by having clients send _raw_ UTF-8, with servers implementing whatever mappings/prohibitions might be needed.)
> One more recent set of decisions is reminiscent of the IDNA > ACE/Punycode one. If there were no IDN TLDs and, preferably, a > very small and infrequently-changing number of TLDs total, then > it would be fairly easy to devise ways to distinguish between > UTF-8-using private namespaces and A-label-using public ones.
If we ever consider that, then my proposal in this sub-thread should get serious consideration. However, I hope we don't.
> ICANN has not seemed to be very interested in that issue and the > tradeoffs it implies.
Good.
> In this context, Shawn wrote: > << The good thing about Punycode/IDN is that it enabled DNS. The << bad thing is that suddenly any network app needs to become a << DNS expert.
Again, and again, I'll keep coming back to this: better APIs can help avoid this. The whole reason I subscribed and started posting to this list last week was that we need improved APIs. Your I-D describes the problem and hints at solutions, but it targets Informational status. Instead I propose that we pursue some Standards-Track APIs, as in Simon's IDNA-extensions-for-getaddrinfo()/getnameinfo() I-D.
> Borrowing a theme from another discussion that has been going on > in parallel, the good thing about getnameinfo and getaddrinfo > are that they enable IPv6. The bad thing is that suddenly any > network app needs to become a routing preferences expert.As
Really? I don't see the analogy.
> Ned Freed pointed out in that context, if you really want this > to be transparent to the application, the relevant interface is > some flavor of "SetupConnectionByName" with which the > application starts with an opaque name and then, subject to some > parameters or function-name variations, ends up with a > connection. Sadly, taking away the need for expert knowledge of > the DNS alone really doesn't help a lot.
Exactly! Ned's "SetupConnectionByName" is an example of "better APIs".
1706102202 - Shawn Steele <Shawn.Steele@microsoft.com>
> Just UTF-8, un-pre-processed, raw user input? Or did you mean U-labels?
I meant, in non-DNS cases, it doesn't really matter. If they aren't U-labels, they won't work (just like (*&$(*&.com won't work)), but other protocols shouldn't have to know how DNS behaves.
> Also, with respect to deployed protocols that have protocol elements for carrying > domainnames, where those protocol elements are defined as carrying UTF-8, but > where in practice most implementors did not actually code those slots as IDN- > aware, wouldn't it be a strong presumption that the slots are IDN-unaware?
My assertion is that applications should use Unicode to enable globalization. My app doesn't have to be IDN aware or unaware, so long as it uses system APIs that "do the right thing." The problem is that punycode leaks into everything, then suddenly anyone handling a name has to know how ACE works, instead of just treating it as an opaque string.
It's reasonably easy to build a network enabled app. You can call system APIs on most systems to open connections or resolve names. If you're handling a protocol, you may need to know some protocol specific stuff, but that's the app's domain (as in area/field, not name). Apps may need to know how to parse their protocol to get a host name, and then pass that to the system APIs, but why should they have to know how to convert to ACE, compare ACE vs Unicode, etc.? Presuming that those operations are interesting to apps, then there should be things like "CompareHostName()" functions so that apps don't have to worry about IDN or what the various forms a name can take.
EAI is a good example of layering. The protocol doesn't have to know anything about Punycode or details of DNS, it just uses UTF-8. At some point an EAI app will have to connect to a name server, and, hopefully, it can do so by calling a UTF-16 or UTF-8 aware API (or native code page), that does the right conversions, using UTF-8 or whatever on Intranet requests, and ACE on Internet requests as necessary. EAI never has to worry about different names.
And, FWIW, if I were building a name server, I'd let it accept UTF-8 requests (They'd have to be U-labels, so the server'd have to use the UTS#46 mappings like any client would, however it wouldn't matter as long as the rules were consistent).
1706102214 - Andrew Sullivan <ajs@shinkuro.com>
Shawn Steele wrote: > > And, FWIW, if I were building a name server, I'd let it accept UTF-8 requests (They'd have to be U-labels, so the server'd have to use the UTS#46 mappings like any client would, however it wouldn't matter as long as the rules were consistent).
If you were building a nameserver that way, you'd be doing it wrong. DNS is _already_ 8-bit clean, and always was. It's right there in the definition in RFC 1034 and 1035. _Any_ octet is allowed in DNS labels.
The problem is that those aren't allowed in registerable domain names, which are subject to hostname restrictions defined outside the DNS. These are really policy matters, and not protocol matters, but owing to a long history, the distincion was not always understood by implementers and so we ended up with a lot of rules that were in fact policy matters getting enshrined in "protocol" broadly (mis)understood.
Indeed, the reason so-called UTF-8 "native" labels and other such stuff all sort of works in a lot of places is exactly _because_ the DNS was designed with the possibility in mind that the world would leave 7-bit ASCII restrictions behind.
1706102222 - Andrew Sullivan <ajs@shinkuro.com>
> See "First: to note that...". That is: I don't think DNS, much less > _application_ implementors can be expected to support private DNS clouds > with non-standard IDN rules. It's just too big a PITA.
Hold on, there. The DNS allows _octets_ in domain name labels. That is, you can put "*&^_+é" to you heart's content in a DNS label, and it all oughta be legal. STD13 is perfectly clear on that:
Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.
[…]
The rationale for this [different-context] choice is that we may someday need to add full binary domain names for new services; existing services would not be changed.
The actual facts of the matter, and those facts' interaction with other conventions, restrictions, and the myriad deployed stuff, is rather different, which is how we got to IDNA2008. But claiming that "DNS can't be expected to support private DNS clouds with non-standard IDN rules" misses the boat by almost 25 years. It always did.
1706102239 - Nicolas Williams <Nicolas.Williams@oracle.com>
> Hold on, there. The DNS allows _octets_ in domain name labels. That > ...
It does. However, there's no way that anyone will bother making getaddrinfo(), DNS resolver, and application implementations that actually know when to send A-labels versus when to send something else, much less what that something else ought to be.
> The actual facts of the matter, and those facts' interaction with > other conventions, restrictions, and the myriad deployed stuff, is > rather different, which is how we got to IDNA2008. But claiming that > "DNS can't be expected to support private DNS clouds with non-standard > IDN rules" misses the boat by almost 25 years. It always did.
DNS can't work interoperably with multiple IDN rulesets for the simple reason that to do so would require code to decide amongst IDN rules to apply in context-specific manners. The necessary guidance for implementors to do that is missing to begin with. And since there almost certainly would be more than one set of IDN rules to choose from (ISO8859-* encoded labels, UTF-8 encoded labels, with either un-pre-processed or normalized-and/or-case-folded Unicode, IDNA2008 ACE encoded labels, and so on).
At best one can narrow this to two sets of rules: IDNA2008 and just-send-8-with-all-nodes-using-the-same-locale (and input methods). And even then there's no standard way to know when to use one or the other, and the latter isn't really realistic except for the tiniest deployments.
If you really, really want this to work, then start thinking about solutions along the lines of my strawman proposal for an NS-like RR that indicates what IDN rules apply to delegated zones. I'd rather help make IDNA2008 better by working on the APIs aspect of the problem.
1706102247 - Nicolas Williams <Nicolas.Williams@oracle.com>
> I meant, in non-DNS cases, it doesn't really matter. If they aren't > U-labels, they won't work (just like (*&$(*&.com won't work)), but > other protocols shouldn't have to know how DNS behaves.
That's not really a useful answer.
However, I believe it'd be fine in NFSv4 to send un-pre-processed, raw user input in UTF-8 and let the receiver apply ToASCII() or ToUnicode(ToASCII()) as necessary. Note that non-U-label UTF-8 would work with this approach.
<< Also, with respect to deployed protocols that have protocol elements << for carrying domainnames, where those protocol elements are defined << as carrying UTF-8, but where in practice most implementors did not << actually code those slots as IDN- aware, wouldn't it be a strong << presumption that the slots are IDN-unaware? > > My assertion is that applications should use Unicode to enable > globalization. My app doesn't have to be IDN aware or unaware, so > long as it uses system APIs that "do the right thing." The problem is > that punycode leaks into everything, then suddenly anyone handling a > name has to know how ACE works, instead of just treating it as an > opaque string.
Indeed, it's all about those system APIs. If the application send raw user input encoded in UTF-8 and the peer passes that to getaddrinfo(), and getaddrinfo() does the Right Thing, then everything works. And the application is left pretty darned simple.
Achieving that level of simplicity has been my goal in engaging this list.
It's not necessarily that simple in all cases. For example, if the application needs to format LDAP DNs using the DC name attribute to hold domainname labels (e.g., DC=foo,DC=example) then the application has to make sure to use A-labels. However, if getaddrinfo() by default returns A-labels as canonical names, then the application still has nothing special to do. The point is that there are going to be a variety of cases all of which have to be handled on a case-by-case basis.
1706102255 - Nicolas Williams <Nicolas.Williams@oracle.com>
> It does. However, there's no way that anyone will bother making > getaddrinfo(), DNS resolver, and application implementations that > actually know when to send A-labels versus when to send something else, > much less what that something else ought to be.
I should qualify this: someone might do that, but more than one such implementation, with IDN interoperability between them? I seriously doubt it.
1706102257 - Andrew Sullivan <ajs@shinkuro.com>
> It does. However, there's no way that anyone will bother making > getaddrinfo(), DNS resolver, and application implementations that > actually know when to send A-labels versus when to send something else, > much less what that something else ought to be.
I think this is probably right.
> DNS can't work interoperably with multiple IDN rulesets for the simple > reason that to do so would require code to decide amongst IDN rules to > apply in context-specific manners.
Right. See John Klensin's previous remarks about this: in small communities of well-known behaviour, your favourite encoding as octets in the zone work fine. But given that we have multiple different encodings, we surely do have a problem. It's nevertheless simply too late to say that the only thing anyone is allowed to put in a DNS zone is an A-label. We don't get to reformat the Internet like that. The DNS rules were established a long time ago, so there _is_ non-A-label data in zone files already.
> If you really, really want this to work, then start thinking about > solutions along the lines of my strawman proposal for an NS-like RR that > indicates what IDN rules apply to delegated zones. I'd rather help make > IDNA2008 better by working on the APIs aspect of the problem.
I suggested similar things more than once over the past couple years, and people told me every time that I might be running for the position of "Bad Idea Fairy".
1706102302 - Shawn Steele <Shawn.Steele@microsoft.com>
> The point is that there are going to be a variety of cases all of which have to be handled on a case-by-case basis.
Yes, but I'd like to encourage the case-by-case to try to avoid punycode when possible. IMO it's better to say "let's use UTF-8 in this 8 bit slot" rather than "let's jam in punycode because it's easy." Both require updates to the system..
1706102303 - Nicolas Williams <Nicolas.Williams@oracle.com
<< It does. However, there's no way that anyone will bother making << getaddrinfo(), DNS resolver, and application implementations that << actually know when to send A-labels versus when to send something else, << much less what that something else ought to be. > > I think this is probably right.
Good, then we can focus on moving forward with IDNA2008 :)
<< DNS can't work interoperably with multiple IDN rulesets for the simple << reason that to do so would require code to decide amongst IDN rules to << apply in context-specific manners. > > Right. See John Klensin's previous remarks about this: in small > communities of well-known behaviour, your favourite encoding as octets > in the zone work fine. But given that we have multiple different > encodings, we surely do have a problem. It's nevertheless simply too > late to say that the only thing anyone is allowed to put in a DNS zone > is an A-label. We don't get to reformat the Internet like that. The > DNS rules were established a long time ago, so there _is_ non-A-label > data in zone files already.
I'm not sure you can even get this to work in tiny environments, since soon enough most operating systems and applications will implement IDNA...
<< If you really, really want this to work, then start thinking about << solutions along the lines of my strawman proposal for an NS-like RR that << indicates what IDN rules apply to delegated zones. I'd rather help make << IDNA2008 better by working on the APIs aspect of the problem. > > I suggested similar things more than once over the past couple years, > and people told me every time that I might be running for the position > of "Bad Idea Fairy".
I don't think it's necessarily a bad idea, just a decade or so late.
1706102307 - Nicolas Williams <Nicolas.Williams@oracle.com>
> Yes, but I'd like to encourage the case-by-case to try to avoid > punycode when possible. IMO it's better to say "let's use UTF-8 in > this 8 bit slot" rather than "let's jam in punycode because it's > easy." Both require updates to the system..
On the one hand, I agree: ACE leakage into UIs is bad, therefore ACE avoidance is good.
On the other hand I disagree: non-A-label leakage into IDN-unaware domainname slots (in APIs, protocols, on-disk formats) is a bad thing.
In the long-term I think the latter is less bad than the former, but in the short-term I think the latter is worse than the former.
In terms of protocol specifications, what really matters is that we provide the correct guidance and that implementors heed it. If implementors don't heed the guidance we provide then things break anyways, in which case which is the lesser evil: ACE leakage into UIs or non-ASCII leakage into IDN-unaware domainname slots?
1706102328 - John C Klensin <klensin@jck.com>
> Right. See John Klensin's previous remarks about this: in > small communities of well-known behaviour, your favourite > encoding as octets in the zone work fine. But given that we > have multiple different encodings, we surely do have a > problem. It's nevertheless simply too late to say that the > only thing anyone is allowed to put in a DNS zone is an > A-label. We don't get to reformat the Internet like that. The > DNS rules were established a long time ago, so there _is_ > non-A-label data in zone files already.
And that clearly applies to server-side application of UTR46 or any other trick matching as well. It isn't just that it violates the spec (since the server-side matching rules for octets are extremely clear), it is that some servers would be extended to handle the special mapping, some would not, and one couldn't tell the difference. Even then, one would have to assume that every server that did any mapping did it the same way. Despite a lot of interesting ideas and no matter how many standards were approved by whatever bodies approved them, that is profoundly unrealistic.With or without different mapping variations, "some map and some don't" could, in turn, easily yield false positives, false negatives, and a collection of "interesting" attack vectors.
<< If you really, really want this to work, then start thinking << about solutions along the lines of my strawman proposal for << an NS-like RR that indicates what IDN rules apply to << delegated zones. I'd rather help make IDNA2008 better by << working on the APIs aspect of the problem. > > I suggested similar things more than once over the past couple > years, and people told me every time that I might be running > for the position of "Bad Idea Fairy".
What causes the BIF problem is the combination of the slightly-odd relationship between NS records and the RR sets to which one wants the data to be bound ("slightly-odd" not because the behavior isn't well defined but because it doesn't do what one wants for this purpose). If nothing else, the possible error states when the NS and interpretation records contained different information in the parent and child zones would be a challenge -- well-defined, if sometimes surprising to the naive for the NS case, but an interesting design challenge for the "label interpretation" case, especially one remembers that a cache would have to retrieve and maintain the interpretation data on a zone by zone basis (probably not apparent-label by apparent-label).
The DNS works as well as it does partially because, while caches have to follow a few specific rules (including those for octet-level matching of labels in length-label pair form), caches can be pretty dumb. Asking caches to be smart and able to reflect whatever matching rules the authoritative servers (and/or their authoritative parents) think appropriate means _really_ smart caches.
And then there is the DNAME possibility and the consequent need for new primitives that authoritatively identify the tree in which an FQDN target is really located.
If one wanted it to work, I suggest that one would want to start by deprecating DNAME and maybe CNAME so that there was exactly one way to access a particular DNS node. Then one would need to think about at least one of a new Label Type (my current favorite), a new Class (probably not good enough, my early proposal to that effect notwithstanding), or an EDNS0 option to permit a client to differentiate among servers applying different rules (as far as I know, not yet comprehensively evaluated by anyone). The three of those options have two things in common:
- (i) Good luck getting them deployed soon enough and
widely enough to do anyone any good. Think in decades.
- ii) We would still be stuck with legacy A-labels in
zones and the need to sort them out in applications. Some zones could be expected to at least stop adding more of them but those that were driven by either market or compatibility considerations would probably discover that they had to deploy every name according to both the old (IDNA A-labels) and new (e.g., UTF8 with UTR46-2025) conventions.Synchronized domains anyone? :-(
While you are at it, I'd like a pony. Actually, I'd like a whole corral full of ponies.
1706102336 - Dave Thaler <dthaler@microsoft.com>
> It does. However, there's no way that anyone will bother making getaddrinfo(), > DNS resolver, and application implementations that actually know when to send > A-labels versus when to send something else, much less what that something > else ought to be.
Not true.We already have many widely deployed applications that attempt to do exactly that (IE is one of them, and there's a number of others). And of course they get it wrong in corner cases, so few people notice.
I talked about some of them near the end of the IETF plenary talk.
1706102340 - Nicolas Williams <Nicolas.Williams@oracle.com>
> Not true.We already have many widely deployed applications that attempt to > do exactly that (IE is one of them, and there's a number of others). And of > course they get it wrong in corner cases, so few people notice. > > I talked about some of them near the end of the IETF plenary talk.
Details? Which apps, and how do they know when to do IDNA versus something else, and what is that something else? Does this interop with other implementations?
1706102342 - Dave Thaler <dthaler@microsoft.com>
> Exactly! Ned's "SetupConnectionByName" is an example of "better APIs".
As Stuart mentioned in the IPv6 panel plenary, such apis already exist on both Windows and MacOS.
1706102345 - Dave Thaler <dthaler@microsoft.com>
> Details? Which apps, and how do they know when to do IDNA versus something > else, and what is that something else? Does this interop with other > implementations?
The something else is UTF-8.
See the plenary slides. IE tries to guess based on its application configuration for intranet vs internet sites. Other apps like Outlook and Windows Media Player try one (ACE form vs UTF-8) first and then try the other.
1706102346 - Nicolas Williams <Nicolas.Williams@oracle.com>
John C Klensin wrote: > The DNS works as well as it does partially because, while caches > have to follow a few specific rules (including those for > octet-level matching of labels in length-label pair form), > caches can be pretty dumb. Asking caches to be smart and able > to reflect whatever matching rules the authoritative servers > (and/or their authoritative parents) think appropriate means > _really_ smart caches.
Yes. That might be realistic if correct implementations existed that were BSD-licensed, portable and very self-contained. As it is this idea is not realistic.
> And then there is the DNAME possibility and the consequent need > for new primitives that authoritatively identify the tree in > which an FQDN target is really located. > > If one wanted it to work, I suggest that one would want to start > by deprecating DNAME and maybe CNAME so that there was exactly
I don't think you'd have to deprecate CNAME. In any case, CNAME can't really be deprecated -- it's too useful and too widely in use. I can see operators coming at us with pitchforks if we tried.
> one way to access a particular DNS node. Then one would need to > think about at least one of a new Label Type (my current > favorite), a new Class (probably not good enough, my early > proposal to that effect notwithstanding), or an EDNS0 option to > permit a client to differentiate among servers applying > different rules (as far as I know, not yet comprehensively > evaluated by anyone). The three of those options have two > things in common: > >(i) Good luck getting them deployed soon enough and >widely enough to do anyone any good. Think in decades. > >(ii) We would still be stuck with legacy A-labels in >zones and the need to sort them out in applications. >Some zones could be expected to at least stop adding >more of them but those that were driven by either market >or compatibility considerations would probably discover >that they had to deploy every name according to both the >old (IDNA A-labels) and new (e.g., UTF8 with UTR46-2025) >conventions.Synchronized domains anyone? :-(
(ii) is a huge problem. IDNA is a reality now. Which is why we must work with IDNA rather than for alternatives.
1706102349 - Nicolas Williams <Nicolas.Williams@oracle.com>
> The something else is UTF-8. > > See the plenary slides. IE tries to guess based on its application configuration > for intranet vs internet sites. Other apps like Outlook and Windows Media Player > try one (ACE form vs UTF-8) first and then try the other.
There have been many plenaries... Which one? IETF76?
Does this interop with Firefox?
1806100000 - Shawn Steele <Shawn.Steele@microsoft.com>
> On the other hand I disagree: non-A-label leakage into IDN-unaware > domainname slots (in APIs, protocols, on-disk formats) is a bad thing.
That's nearly impossible to guarantee one way or the other. http for example. In http, shoving UTF-8 where it wasn't expected in an http request might not "work", however shoving punycode into the slot pretty much requires that someone be able to compare the punycode with the U-label and see if it’s the same. Both approaches are likely broken in some cases, both might work in some cases.
> in which case which is the lesser evil: ACE leakage into UIs or non-ASCII leakage into IDN-unaware domainname slots?
It's not just UI, ACE is a cascading reaction, and then it leaks into places that were UTF-8/Unicode aware, so some place that already worked just fine with Unicode names has to make a change to realize that ACE is the same form, even though they may not have needed a change.
ACE has "broken" almost everything here, even though ACE nominally shouldn't be a problem. Those breaks are more ironic as most of those broken pieces already worked with Unicode.
For example, RFC 5280. It had to be updated to support ACE, which was convenient, but now what do you do about the email local parts? There's no punycode for email, so the ACE workaround in RFC5280 is temporary at best. It'll either have to: A) allow UTF-8, B) Allow some special variant of punycode that works for email, or C) use or invent some other encoding. So now everything that uses 5280 has to be updated twice :(
1806100005 - Dave Thaler <dthaler@microsoft.com>
> There have been many plenaries... Which one? IETF76?
Yes http://www.ietf.org/proceedings/76/slides/plenaryt-1.pdf See especially slides 55-57
> > Does this interop with Firefox?
Not sure what you mean by "interop" as it's purely a local algorithm. Can you rephrase your question?
1806100020 - Nicolas Williams <Nicolas.Williams@oracle.com>
Shawn Steele wrote: > ACE has "broken" almost everything here, even though ACE nominally > shouldn't be a problem. Those breaks are more ironic as most of those > broken pieces already worked with Unicode. > > For example, RFC 5280. It had to be updated to support ACE, which was > convenient, but now what do you do about the email local parts? > There's no punycode for email, so the ACE workaround in RFC5280 is > temporary at best. It'll either have to: A) allow UTF-8, B) Allow > some special variant of punycode that works for email, or C) use or > invent some other encoding. So now everything that uses 5280 has to > be updated twice :(
RFC3280 (which 5280 obsoletes) used IA5String from dNSName, which means that UTF-8 couldn't have been relied upon to work in that field.
I understand your sentiment, but I don't think the problem is IDNA, nor ACE. I think the problem is that we didn't have Unicode and UTF-8 and widely deployed in the early 1980s, or better, late 1970s
1806100022 - Nicolas Williams <Nicolas.Williams@oracle.com>
> Not sure what you mean by "interop" as it's purely a local algorithm. > Can you rephrase your question?
If a user e-mails another a URL using an IDN encoded in UTF-8, and the receipient tries to load it with FF, and the namespace in question uses UTF-8 instead of IDNA ACE encoding, will the FF user be able to load that URL?
What about URLs with such IDNs embedded in HTML served by servers in that namespace?
1806100025 - Shawn Steele <Shawn.Steele@microsoft.com>
> I understand your sentiment, but I don't think the problem is IDNA, nor > ACE. I think the problem is that we didn't have Unicode and UTF-8 and > widely deployed in the early 1980s, or better, late 1970s.
Certainly :) But the standards have to be revised to support globalization regardless of the mechanism. There are many options and ACE seems like an easy fix in many cases. I fear that the short-term gain though isn't worth the longer-term pain.
1806100056 - Shawn Steele <Shawn.Steele@microsoft.com>
1806100056 - Shawn Steele <Shawn.Steele@microsoft.com>
> If a user e-mails another a URL using an IDN encoded in UTF-8, > and the receipient tries to load it with FF, and the namespace in > question uses UTF-8 instead of IDNA ACE encoding, will the FF > user be able to load that URL?
It should, the OS sends the request to whatever your default browser is in Unicode, the browser better be able to correctly handle the link. AFAIK there's no big problems in this area in Windows (maybe some edge cases). Since we use *W APIs for everything, most of these scenarios work. Problem is only if the client doesn't know how to punycode, or if punycode gets injected unexpectedly into the *W API calls, or if there's some other protocol dependency that hasn't been updated beyond ASCII.
> What about URLs with such IDNs embedded in HTML served by > servers in that namespace?
Href="http://non-ascii" generally works in most browsers, HTML has a code page, so this isn't ambiguous (unless the code page is wrong or missing).
1806100114 - Nicolas Williams <Nicolas.Williams@oracle.com>
> It should, the OS sends the request to whatever your default browser > is in Unicode, the browser better be able to correctly handle the > link. AFAIK there's no big problems in this area in Windows (maybe
So the OS knows about the intranet/internet distinction? Dave Thaler's comments had led me to believe it was IE that knew this. What if parts of the intranet are not hosted on AD? Where's the intranet/internet distinction configured?
> some edge cases). Since we use *W APIs for everything, most of these > scenarios work. Problem is only if the client doesn't know how to > punycode, or if punycode gets injected unexpectedly into the *W API > calls, or if there's some other protocol dependency that hasn't been > updated beyond ASCII.
And if the receipient is running FF on something other than Windows? (e.g., MacOS X, Linux, Solaris, *BSD.)
I'm guessing the answer is "no", "I don't know", or even "good luck" :)
The point is: your private-namespace-with-UTF-8-instead-of-IDNA solution is almost certainly not interoperable in heterogeneous deployments.
(Pardon the redundancy. I shouldn't have to say "in heterogeneous deployments", because that's implied, to me, by the word "interoperable", but I do so anyways to be extra clear.)
1806100130 - Shawn Steele <Shawn.Steele@microsoft.com>
> So the OS knows about the intranet/internet distinction?
No, if you type Windows+R and http://somewhere.com, that URL is passed to the target app in Unicode, it doesn't matter where the target is. The intranet (utf-8) / internet (ACE, actually, I did get utf-8 across the 'net years ago) distinction is currently handled by the application. Basically it's a "Try UTF-8, if that doesn't work, try ACE" approach, but some apps do it the other way around. Dave's looking at how to make the APIs smarter so the apps don't have to make silly guesses.
> And if the receipient is running FF on something other than Windows? > (e.g., MacOS X, Linux, Solaris, *BSD.)
I don't have a clue how those work :)
My understanding is that most browsers are happy to convert URLs to Punycode as necessary, I can't imagine why they'd have different logic on other OS's. Certainly as a web author, I am NOT going to write href="http://xn--punycode". Content authors are going to use Unicode (because they can read it). So, unless you "fix" all the blog tools, etc. to convert Unicode to punycode, there's lots of hrefs in the wild that are outside of the ASCII space.
I've been led to believe that non-punycode hrefs have been seen "in the wild". (Indeed I think they outnumber punycode ones).
I cannot imagine why you'd want to force a Unicode string to punycode to pass it between applications, unless you are doing a DNS query. The canonical form should be U-label and that's what apps should exchange.
1806100139 - Nicolas Williams <Nicolas.Williams@oracle.com>
<< So the OS knows about the intranet/internet distinction? > > No, if you type Windows+R and http://somewhere.com, that URL is passed > to the target app in Unicode, it doesn't matter where the target is. > The intranet (utf-8) / internet (ACE, actually, I did get utf-8 across > the 'net years ago) distinction is currently handled by the > application. Basically it's a "Try UTF-8, if that doesn't work, try > ACE" approach, but some apps do it the other way around. Dave's > looking at how to make the APIs smarter so the apps don't have to make > silly guesses.
OK, that (try UTF-8 first, then ACE) could work. Do you prepare the UTF-8 in any way? E.g., normalize it? Case-fold it? Or do you rely on user input being in NFC already due to how input methods work?
<< And if the receipient is running FF on something other than Windows? << (e.g., MacOS X, Linux, Solaris, *BSD.) > > I don't have a clue how those work :) > > My understanding is that most browsers are happy to convert URLs to > Punycode as necessary, I can't imagine why they'd have different logic > on other OS's. Certainly as a web author, I am NOT going to write > href="http://xn--punycode". Content authors are going to use Unicode > (because they can read it). So, unless you "fix" all the blog tools, > etc. to convert Unicode to punycode, there's lots of hrefs in the wild > that are outside of the ASCII space. > > I've been led to believe that non-punycode hrefs have been seen "in > the wild". (Indeed I think they outnumber punycode ones). > > I cannot imagine why you'd want to force a Unicode string to punycode > to pass it between applications, unless you are doing a DNS query. > The canonical form should be U-label and that's what apps should > exchange.
My concern was: how do you do hostname resolution in an environment with private sub-namespaces with non-IDNA IDN rules.
1806100153 - Shawn Steele <Shawn.Steele@microsoft.com>
> OK, that (try UTF-8 first, then ACE) could work. Do you prepare the > UTF-8 in any way? E.g., normalize it? Case-fold it?
That's an area that is really not good right now. DNS is obviously case-insensitive in ASCII, however the UTF-8 that we've allowed is pretty dumb and doesn't do any mapping/filtering. It is "obvious" that we should only allow UTS#46 behavior type U-Labels in the future, however getting to that point might be tricky since internal machines could already have names that conflict with that mapping.. (We could apply those rules to the UTF-8 string when doing lookup). Assuming someone's using the canonical U-Label form, it's not a problem. Dave's working with some folks looking at that.
> My concern was: how do you do hostname resolution in an environment with private sub-namespaces with non-IDNA IDN rules.
Well, that's where "try utf-8 1st, then ACE" happens. Of course, if they don't both use UTS#46 mappings, then you've got a problem since one may resolve one way and not the other :(I think the "right" answer would be to use UTS#46 conformant U-Labels when doing UTF-8 lookup, however that's breaking in some environments.
Historically, it seems to have worked due to coincidences like the IME's generally enter data in the same form, etc., so we're unlikely to see NFD requests for NFC names, but it's obviously a point of weakness. I'd expect other environments got away with similar things in native code page requests, etc.
1806100204 - Nicolas Williams <Nicolas.Williams@oracle.com>
> That's an area that is really not good right now. DNS is obviously > case-insensitive in ASCII, however the UTF-8 that we've allowed is > pretty dumb and doesn't do any mapping/filtering. It is "obvious" > that we should only allow UTS#46 behavior type U-Labels in the future, > however getting to that point might be tricky since internal machines > could already have names that conflict with that mapping.. (We could > apply those rules to the UTF-8 string when doing lookup). Assuming > someone's using the canonical U-Label form, it's not a problem. > Dave's working with some folks looking at that.
Thanks, this is useful information. Particularly if you want anyone else to interop with this.
<< My concern was: how do you do hostname resolution in an environment << with private sub-namespaces with non-IDNA IDN rules. > > Well, that's where "try utf-8 1st, then ACE" happens. Of course, if > they don't both use UTS#46 mappings, then you've got a problem since > one may resolve one way and not the other :(I think the "right" > answer would be to use UTS#46 conformant U-Labels when doing UTF-8 > lookup, however that's breaking in some environments. > > Historically, it seems to have worked due to coincidences like the > IME's generally enter data in the same form, etc., so we're unlikely > to see NFD requests for NFC names, but it's obviously a point of > weakness. I'd expect other environments got away with similar things > in native code page requests, etc.
NFD can and has leaked. MacOS X's HFS+ normalizes filenames to NFD on create, and when you list directories the names appear in NFD. (This is why we chose to implement normalization-insensitivity in ZFS.) If someone were to cut-and-paste from a MacOS X HFS+ filesystem into an IDN-unaware domainname slot, particularly a domainname registration slot... you'd have a problem.
So I definitely recommend normalizing to NFC first, and you might as well apply UTS#46. (Names that breaks because of differences in IDNA2003 and 2008 will be few and far between, and not that big a deal. Affected users may be annoyed at first, but also presumably overjoyed as well at the aesthetic/semantic improvement.)
1806100257 - Shawn Steele <Shawn.Steele@microsoft.com
> NFD can and has leaked. MacOS X's HFS+ normalizes filenames to > NFD on create, and when you list directories the names appear in NFD.
We're aware of that issue.
> (Names that breaks because of differences inIDNA2003 and 2008 will > be few and far between, and not that big a deal.)
Actually IDNA2008 doesn't provide any standard mapping form, so users expecting reasonable IDNA2003 mappings will break quite often :( Fortunately UTS#46 helps with that.
> Affected users may be annoyed at first, but also presumably overjoyed as well at the aesthetic/semantic improvement.)
Primarily we hear when they're annoyed :) When something "breaks" we are bound to have tons of feedback about the change. Obviously we want to "do the right thing." Getting there from the current installed base is sometimes tricky.
1806101501 - Simon Josefsson <simon@josefsson.org>
> There's no harm in having such a flag.
Great. I'm thinking of updating my draft to specify this, and to make it a bit less internal-memo-like.
> I also want a AI_NO_CANONIDN, in preparation for a future day when > getaddrinfo() returns U-labels as the canonical hostname by default.
I'm not sure this day will come. There will always be software components that are written simplistically without IDNA support. But I'm not strongly against the flag. The downside is that it adds complexity with uncertain gain, something I generally prefer to avoid.
The AI_NO_CANONIDN flag could also be introduced later on, when we actually consider making the change you suggest, and we may be in a more informed situation to specify the flag more precisely at that point too.
1806101937 - Nicolas Williams <Nicolas.Williams@oracle.com>
> If you were building a nameserver that way, you'd be doing it wrong. > DNS is _already_ 8-bit clean, and always was. It's right there in the > definition in RFC 1034 and 1035. _Any_ octet is allowed in DNS > labels.
I don't think you're being creative enough :)
A DNS zone could easily have N copies of the same RRset, all with different but equivalent IDNs:
foÓ.example.IN A 1.1.1.1 foó.example.IN A 1.1.1.1 xn--fo-6ja.example. IN A 1.1.1.1 ...
A server serving such a zone could look as though it knew how to do pretty smart matching. AS LONG AS the names of the RRs in the reply match those in the query.
It would be perfectly fine to produce a tool that generates all the possible aliases of a label given some set of matching rules. For example, if we want normalization- and case-insensitive matching then there'd be four variants of 'foó'. Apply such a tool to your zone files and you can make any dumb DNS server suddenly seem pretty smart!
Of course no one would do _that_ (for every char you could have as many as a dozen combinations more equivalent labels, resulting in an exploding zone file size). But a DNS server that implemented case- and normalization-insensitive UTF-8 matching would be indistinguishable from a dumb server serving such zones.
I.e., DNS servers can, in fact, do what Shawn proposes and still be compliant as long as all servers for a given zone behave the same way.
Indeed, if _I_ were developing a DNS server I'd provide an option to treat A-labels, U-labels and raw UTF-8 equivalent names as equivalent. I think that's probably the ideal thing to do in a private namespace using UTF-8 labels because IDNA-compliant nodes will be able to find equivalent A-labels, with happy interoperating users as a result.
> The problem is that those aren't allowed in registerable domain names, > which are subject to hostname restrictions defined outside the DNS.
That's another, separate issue, one that doesn't really apply to private namespaces. Shawn's still on solid ground.
> These are really policy matters, and not protocol matters, but owing
Right.
> to a long history, the distincion was not always understood by > implementers and so we ended up with a lot of rules that were in fact > policy matters getting enshrined in "protocol" broadly > (mis)understood.
I assume you mean middle-boxes (caching servers) that aren't 8-bit clean.
But again, for a private namespace that's probably not a problem. And it's probably not a problem at all, whether in private or public namespaces
1806101940 - Andrew Sullivan <ajs@shinkuro.com>
>Shawn Steele wrote: > I cannot imagine why you'd want to force a Unicode string to punycode to pass it between applications, unless you are doing a DNS query. The canonical form should be U-label and that's what apps should exchange. >
It seems to me that if you're going to require the exchange of only U-labels, and you're going to validate your input, then it actually doesn't matter whether you exchange U-labels or A-labels: they're freely convertible back and forth, and if you want to know whether something is a valid U-label (for instance), the easiest way might well be just to convert it to an A-label, and back into a U-label, and see if you get the binary equivalent as output.
So an application that expects to hand around DNS label slots as U-labels actually needs to be able to cope with A-labels, and conversely, it seems to me.
1806101940 - Nicolas Williams <Nicolas.Williams@oracle.com>
<< There's no harm in having such a flag. > > Great. I'm thinking of updating my draft to specify this, and to make > it a bit less internal-memo-like.
Sure, please do!
<< I also want a AI_NO_CANONIDN, in preparation for a future day when << getaddrinfo() returns U-labels as the canonical hostname by default. > > I'm not sure this day will come. There will always be software > components that are written simplistically without IDNA support. But > I'm not strongly against the flag. The downside is that it adds > complexity with uncertain gain, something I generally prefer to avoid.
Does it add complexity? Where? At most you have to check for mutually exclusive flag uses.
> The AI_NO_CANONIDN flag could also be introduced later on, when we > actually consider making the change you suggest, and we may be in a more > informed situation to specify the flag more precisely at that point too.
I can certainly see a use for it now: if you want a canonical name that's safe to place in an IDN-unaware domainname slot and want to allow for a future when getaddrinfo() returns U-labels by default.
If we don't provide AI_NO_CANONIDN now then we will never be able to have getaddrinfo() return U-labels by default. That's painting ourselves into a corner. Why?
1806102006 - Andrew Sullivan <ajs@shinkuro.com>
> I don't think you're being creative enough :) > > A DNS zone could easily have N copies of the same RRset, all with > different but equivalent IDNs: > > foÓ.example.IN A 1.1.1.1 > foó.example.IN A 1.1.1.1 > xn--fo-6ja.example. IN A 1.1.1.1 > ... > It would be perfectly fine to produce a tool that generates all the > possible aliases of a label given some set of matching rules.
Yes. Over in DNSEXT we appear to be sliding down the slippery slope of attempting to solve this problem in a more generic way -- to solve not just these kinds of examples but other sorts of aliasing too. If you want to help, we can use it (though I warn you that the cost is getting a good handle on all the twisty passages that are part of the DNS both as deployed as specified).
Some have argued very strongly that the only right thing to do here is solve it entirely with provisioning tools, and to stop trying to make the DNS provide the information to allow inferences.
(While I'm at it, I also want to point out that I've requested a special hour for DNSEXT that is aimed squarely at non-DNS weenies, who want the DNS to do things and are unhappy that "aliases" don't work as they want. This is a plain requirements-gathering exercise: you talk, and we write down. Then we'll at least be able to say, "Yes, can do that," to some things and, "Nope, can't, and here's why," to others.)
> exploding zone file size). But a DNS server that implemented case- and > normalization-insensitive UTF-8 matching would be indistinguishable from > a dumb server serving such zones.
It'd also be violating the matching rules in STD13, as far as I can tell.
> Indeed, if _I_ were developing a DNS server I'd provide an option to > treat A-labels, U-labels and raw UTF-8 equivalent names as equivalent.
I don't know what this means. How would you know that something was a raw UTF-8 label? All you get is a bitstream. You can't tell what encoding it was in. So what would it mean for these to be equivalent? You might get this to work much of the time for much of the Internet, but we're still not quite the Internet Bodge-Up Task Force, and I'd hate for us to become one.
<< to a long history, the distincion was not always understood by << implementers and so we ended up with a lot of rules that were in fact << policy matters getting enshrined in "protocol" broadly << (mis)understood. > > I assume you mean middle-boxes (caching servers) that aren't 8-bit > clean.
And the fact that even longtime IETF participants don't always make the careful distinction between hostname and domain name, never mind people who weren't around when the distinction was one you could actually see.
> But again, for a private namespace that's probably not a problem. And > it's probably not a problem at all, whether in private or public > namespaces.
Ah, yes. Because we all know that them gardens stay behind their walls.
1806102100 - Patrik Fältström <patrik@frobbit.se>
> U-labels and raw UTF-8 equivalent
Please explain the differences between these two.
Patrik
1806102108 - Shawn Steele <Shawn.Steele@microsoft.com>
> It seems to me that if you're going to require the exchange of only > U-labels, and you're going to validate your input
How often is data actually validated? Often href's aren't (at least not when intially entered). Applications just assume a domain name will resolve, and, if it doesn't, it fails then.
1806102122 - Patrik Fältström <patrik@frobbit.se>
<< It seems to me that if you're going to require the exchange of only << U-labels, and you're going to validate your input > > How often is data actually validated? Often href's aren't (at least not when intially entered). Applications just assume a domain name will resolve, and, if it doesn't, it fails then.
This is because programmers think they are working with ascii, and comparison algorithm with ascii is so simple. It is not for "unicode stuff". It is for U-labels, but not for unicode strings in general.
1806102133 - Andrew Sullivan <ajs@shinkuro.com>
> How often is data actually validated? Often href's aren't (at least not when intially entered). Applications just assume a domain name will resolve, and, if it doesn't, it fails then. >
And of course, this blind acceptance of any data from any random place in the Net including random evil humans has caused no trouble? This can't, surely, be the plan, even if it is in fact how things are done. If you're going to insist on U-labels for interchange, you have _no choice_ but to validate them as actually being U-labels, or they are all but guaranteed to have crap in them that will never make it through the IDNA2008 algorithms when it is finally time to do this.
I completely agree with you that it would be insane to require every application to "do DNS". But they can't handle domain name slots in an "internationalized" way, and expect a standard interchange format (with all its restrictions), but take whatever binary data they get (or anyway, not reliably).
1806102150 - Shawn Steele <Shawn.Steele@microsoft.com
Not "crapping out" is simple, the app can make a DNS request and see if something comes back. It hardly matters if it matches all the DNS rules and dots it's i's and crosses it's t's if there no record to back it up. It seems like it should be up to the app to decide if it's worth validating up-front, or allowing the failure to happen rather late.
If the application needs to do validation, then there should be a "CheckValidDomainName" API or something. Applications shouldn't be doing those things themselves. (If they did, then IDNA2003->IDNA2008 would require all apps be touched, not just whatever APIs they call.) Email for example constantly gets complaints about apps inconsistently validating "valid" (but unusual) addresses.
Apps (& even other standards) should NOT be required to know how the bidi rules work and all that. Instead they should point to an SDK to handle that (or standards should point to the standard), instead of rebuilding everything themselves.
1806102200 - Shawn Steele <Shawn.Steele@microsoft.com>
> This is because programmers think they are working with ascii, > and comparison algorithm with ascii is so simple. It is not for > "unicode stuff". It is for U-labels, but not for unicode strings in > general.
Well, yea, people don't handle Unicode very well, which is why it's a good idea to used globalized comparison APIs from whatever OS you're using to compare strings, so you don't have to rebuild it yourselves. (Though you do need to know what flags to use and what APIs to call when :)
If an app needs to compare domain names, they should call the domain name SDK's "CompareNames()" function. Then that can ensure they're in canonical form or whatnot and do the right thing. Of course that requires that someone make such a method for apps to call.
Abstraction is important. Any app that's validating the IDNA2003 bidi rules by themselves now has to change because the bidi rules changed for IDNA2008. Nothing about that app's space changed, only the rules for DNS it depends on.
All I'm trying to get at is that the abstraction is important. People shouldn't embed detailed knowledge in layers where it doesn't belong. Suppose people embed ACE and then "we (the WG/IETF)" decide that we'll allow UTF-8 DNS after all. What happens then? If they just embed Unicode and call "IsValidDomainName()", "CompareDomainName()", "MakeCanonicalDomainName()" or whatever's necessary, then it doesn't matter what happens to IDN in the future, when their library gets updated with any updated behavior, they'll still work.
