IDNA2010 Common Objectives

From IUCG - Internet Users Contributing Group

Jump to: navigation, search

Note introduced by JFC Morfin, IUCG Facilitator IRT. Copyrights in this IUWW

The Internet belongs to everyone. In regards to the Internet technically being the IETF documentation that influences those who build, use, and manage the Internet for it to work and be used better, it is of the utmost importance that:

  • the terms of this documentation belong to everyone.
  • their meaning and definitions freely span the entire Internet technology space, are the same for everyone, may consensually adjust, and can be made available as an IANA table.
  • several groups have proposed terms, lists of words, and glossary drafts, and have discussed post-IDNA2008 needs. If the definition that they gave is adequate, it is better to keep it. If it must be adjusted, this is to be free for all to discuss it. In this process the priority is for everyone to understand the same term in the same way. Whatever the way, whatever its author.

This wiki page has the same legal nature as a mail on an IETF mailing list. It extensively quotes propositions from IETF and non-IETF people and as such is subject (as every IUCG contribution) to the IETF rules and practices that apply, unless adapted and specified otherwise. An IUCG practice is precisely to use working wiki pages because its members are NOT IETF engineers, but rather Internet/IETF users volunteers, who are unfamiliar with the I_D/RFC format and working method, who try to help and prevent misinterpretations between the IETF and lead users who may otherwise develop, document, and deploy uncoordinated technical solutions. The target is to:

1. list all the issues and terms that some have found to belong to the (Internet Use and) Multilingistic Internet by origin.
2. to sort out the issues and terms in order to condense the needed descriptions and address the conflicting definitions.
3. to gather terms by the kind of problem resolution, and to address any possible conflicts.
4. to copy every I_D author, WG, and external architect so they can then decide whether these contributions help them or not.




This glossary strives to gather the IAB, IETF/WG/IDNA2008, IETF/WG/PRECIS, ICANN/WG/VIP, IUCG, IUTF, ALFA, etc.

  • evaluations of the consequences of IDNA2008 on the general Internet architecture, which the IUCG, for clarification purposes, calls:
  • the IDNA2010 problem: the adaptation of the Internet technology (protocols, way of reading RFCs, IETF paradigm) to IDNA2008.
  • the further IDNA2012 problem: the global review and validation of the IDNA2008/IDNA2010 architectural integration and the technical, culutural integration within the digital convergence and its operance (short-term), governance (mid-term) and adminance (long-term) concerns.
  • terms, words, concepts, notions, and positions they use in discussing, building, and operating the resulting multilingual internet and its support of the semiotic and semantic Intersem evolution.


The DRY (Don't Repeat Yourself) Principle states:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
Andrew Hunt
Clarke's Three Laws:
  • When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
  • The only way of discovering the limits of the possible is to venture a little way past them into the impossible.
  • Any sufficiently advanced technology is indistinguishable from magic
Sir Arthur Charles Clarke
Principle of complication:
In all things complication comes from mutual incompatibilities at the point of complexity, that is to say the point from which the polylogue replaced the dialogue generated by the creative monologue. It will disappear in the osmostic completion. This incompatibility often results from false polynymies, this is why a common glossary is always a prerequisite to a solution.
ALFA (Architecture Libre/Free Architecture - The 7 relational planes).

.

Contents


The technical problems to be solved (IDNA2010)


IAB Internationalization Program

July 2010

Exerpts.

Traditionally the IAB has taken an interest in a number of architectural areas. Among the architectural areas ...: Internationalization of the Internet and balance with localization and retention of a global network.

There are some areas that require long-term perspective and may involve various activities and deliverables. For instance, such complex area may require a separate activity for scoping the work (BOFs, presentations, position papers), progressing the work, or stimulating the charter development of new work in the IETF. Such effort may involve collaboration with other organisations.

Work in such areas is organized in the form of a program. Programs can be thought of as IAB directorates, small task forces, or ad-hoc bodies of (independent) technical experts (see RFC2850 Section2.1).

Introduction

Internationalization: continuing the development of guidelines with respect to the trade-offs that need to be made when internationalizing user facing protocol parameters.

The position of the IAB

The IAB has taken several initiatives to further Internationalization (i18n) work within the IETF. Choices in architecture and protocol design may affect a large set of Internet users and the lessons learned from earlier experiences Those experiences include the development of protocols to permit internationalizated content in electronic mail and on the web, policies for character set usage and coding, and the development, evaluation, and evolution of internationalized domain names. That work can and should be subject to ongoing review and generalized and extended into additional areas. With this program the IAB intends to create a longer term effort that not only involves ongoing evaluation and the development of guidance but working with other organizations to expand both our understanding of the issues and theirs.

Goals

Develop and provide guidance for architectural and strategic efforts surrounding internationalization. Generate working documents, organize workshops, and propose and develop relationships with other organizations as needed.

July 2011

Description

Work in the IETF and other areas has exposed the general topic of Internationalization (i18n) as a very complex one, with almost all decisions involving complex tradeoffs along multiple dimensions rather than “right” or “wrong” answers. The IAB intends to try to bring these issues together to reduce the number of decisions that are made on an isolated topic basis and, where appropriate, to review prior IAB and IETF work that has may require updating to reflect accumulated experience. Purpose

The primary program purposes of this program are to:

  • monitor the state of internationalization-related items in the industry including the IETF and related organizations such as ICANN, Unicode, and ISO/IEC JTC1 SC2,
  • provide recommendations (via e.g. email or IAB techchat or document) to the IAB,
  • discuss/review any IAB documents or liaison statements,
  • plan any other internationalization efforts desired by the IAB (e.g., workshop or plenary), and
  • maintain historical knowledge within a committee of the IAB.

Members

IAB Members:

  • Dave Thaler (Program Lead), Olaf Kolkman, Dow Street (Exec. Dir., list admin)

Non-IAB Members:

  • Stuart Cheshire, Leslie Daigle, Patrik Fältström, John Klensin, Pete Resnick, Peter Saint-Andre, Andrew Sullivan


IAB review (RFC 6055)

The quoted text comes from the RFC. More compact criteria for the evaluation of IDNA2010 compliant solutions are necessary).

On many platforms, the name resolution library will automatically use a variety of protocols to search a variety of namespaces, which might be using UTF-8 or other encodings. In addition, even when only the DNS protocol is used, in many operational environments, a private DNS namespace using UTF-8 is also deployed and is automatically searched by the name resolution library.
Uusing multiple canonical formats, and multiple encodings in different protocols or even in different places in the same namespace creates problems. Because of this, and the fact that both IDNA A-labels and UTF-8 are in use as encoding mechanisms for domain names today, RFC 6055 Parts 4 (Recommendations) and 5 (Security considerations)make the recommendations described below.

  • It is inappropriate for an application that calls a general-purpose name resolution library to convert a name to an A-label unless the application is absolutely certain that, in all environments where the application might be used, only the global DNS that uses IDNA A-labels actually will be used to resolve the name.
  • Instead, conversion to A-label form, or any other special encoding required by a particular name-lookup protocol, should be done only by an entity that knows which protocol will be used (e.g., the DNS resolver, or getaddrinfo() upon deciding to pass the name to DNS), rather than by general applications that call protocol-independent name resolution APIs. (Of course, applications that store strings internally in a different format than that required by those APIs, need to convert strings from their own internal format to the format required by the API.) Similarly, even if an application can know that DNS is to be used, the conversion to A-labels should be done only by an entity that knows which part of the DNS namespace will be used.
  • That is, a more intelligent DNS resolver would be more liberal in what it would accept from an application and be able to query for both a name in A-label form (e.g., over the Internet) and a UTF-8 name (e.g., over a corporate network with a private namespace) in case the server only recognizes one. However, we might also take into account that the various resolution behaviors discussed earlier could also occur with record updates (e.g., with Dynamic Update [RFC2136]), resulting in some names being registered in a local network's private namespace by applications doing conversion to A-labels, and other names being registered using UTF-8. Hence, a name might have to be queried with both encodings to be sure to succeed without changes to DNS servers.
  • Similarly, a more intelligent stub resolver would also be more liberal in what it would accept from a response as the value of a record (e.g., PTR) in that it would accept either UTF-8 (U-labels in the case of IDNA) or A-labels and convert them to whatever encoding is used by the application APIs to return strings to applications.
  • Indeed the choice of conversion within the resolver libraries is consistent with the quote from Section 6.2 of the original IDNA specification [RFC3490] stating that conversion using the Punycode algorithm (i.e., to A-labels) "might be performed inside these new versions of the resolver libraries".
    That said, some application-layer protocols (e.g., EPP Domain Name Mapping [RFC5731]) are defined to use A-labels rather than simply using UTF-8 as recommended by the IETF character sets and languages policy [RFC2277]. In this case, an application may receive a string containing A-labels and want to pass it to name resolution APIs.
    Again the recommendation that a resolver library be more liberal in what it would accept from an application would mean that such a name would be accepted and re-encoded as needed, rather than requiring the application to do so.
  • It is important that any APIs used by applications to pass names specify what encoding(s) the API uses. For example, GetAddrInfoW() on Windows specifies that it accepts UTF-16 and only UTF-16. In contrast, the original specification of getaddrinfo() [RFC3493] does not, and hence platforms vary in what they use (e.g., Mac OS uses UTF-8 whereas Windows uses Windows code pages).
  • The question remains about what, if anything, a DNS server should do to handle cases where some existing applications or hosts do IDNA queries using A-labels within the local network using a private namespace, and other existing applications or hosts send UTF-8 queries. It is undesirable to store different records for different encodings of the same name, since this introduces the possibility for inconsistency between them. Instead, a new DNS server serving a private namespace using UTF-8 could potentially treat encoding-conversion in the same way as case-insensitive comparison which a DNS server is already required to do, as long the DNS server has some way to know what the encoding is. Two encodings are, in this sense, two representations of the same name, just as two case-different strings are. However, whereas case comparison of non-ASCII characters is complicated by ambiguities (as explained in the IAB's Review and Recommendations for Internationalized Domain Names [RFC4690]), encoding conversion between A-labels and U-labels is unambiguous.
  • Having applications convert names to prefixed ACE format (A-labels) before calling name resolution can result in security vulnerabilities. If the name is resolved by protocols or in zones for which records are registered using other encoding schemes, an attacker can claim the A-label version of the same name and hence trick the victim into accessing a different destination. This can be done for any non-ASCII name, even when there is no possible confusion due to case, language, or other issues. Other types of confusion beyond those resulting simply from the choice of encoding scheme are discussed in "Review and Recommendations for IDNs" [RFC4690].
  • Designers and users of encodings that represent Unicode strings in terms of ASCII should also consider whether trademark protection or phishing are issues, e.g., if one name would be encoded in a way that would be naturally associated with another organization or product.


Issues in Identifier Comparison for Security Purposes

IAB (Dave Thaler)

The IETF policy on character sets and languages [RFC2277] requires support for UTF-8 in protocols, and as a result many protocols now do support non-ASCII characters. When a hostname is sent in a UTF-8 field, there are a number of ways it may be encoded. For example, labels might encoded directly in UTF-8, or might first be Punycode- encoded or percent-encoded and then encoded in UTF-8.

For example, in URIs, [RFC3986] section 3.2.2 specifically allows for the use of percent-encoded UTF-8 characters in the hostname, as well as the use of IDNA encoding using the Punycode algorithm.

Percent-encoding is unambiguous for hostnames since the percent character cannot appear in the strict definition of a "hostname", though it can appear in a DNS name.

Punycode-encoded labels (or "A-labels") on the other hand can be ambiguous if hosts are actually allowed to be named with a name starting with "xn--", and false positives can result. While this may be extremely unlikely for normal scenarios, it nevertheless provides a possible vector for an attacker.

A hostname comparator used with non-ASCII strings thus needs to decide whether a Punycode-encoded string should or should not be considered a valid hostname label, and if so, then whether it should match the equivalent Unicode string ("U-label").

For example, Section 3.1 of "Transport Layer Security (TLS) Extensions" [RFC3546], states:

If the hostname labels contain only US-ASCII characters, then the client MUST ensure that labels are separated only by the byte 0x2E, representing the dot character U+002E (requirement 1 in section 3.1 of [IDNA] notwithstanding). If the server needs to match the HostName against names that contain non-US-ASCII characters, it MUST perform the conversion operation described in section 4 of [IDNA], treating the HostName as a "query string" (i.e. the AllowUnassigned flag MUST be set). Note that IDNA allows labels to be separated by any of the Unicode characters U+002E, U+3002, U+FF0E, and U+FF61, therefore servers MUST accept any of these characters as a label separator. If the server only needs to match the HostName against names containing exclusively ASCII characters, it MUST compare ASCII names case-insensitively.

[TODO: add observations based on the above text. The text in RFC 3546 now obsolete since IDNA2008 is much more restrictive about the use of dot-oids in IDNs. In addition, conversion between A-labels and Unicode strings that claim to be labels (but not vice versa) turns slightly ambiguous if mapping is permitted and pre- mapping strings may appear.]

ISO 10646

For some additional discussion of security issues that arise with internationalization, see TR36. A throughout study of this document should be a prerequisite for any work on a "graphcode" project. Unfortunately Unicode would not permit that we would carry it in public.


IUCG Problem summary

IPv6 is to give everyone millions of addresses and sub-addresses that we call IDv6 (globally addressable IIDs). People will from then on allocate one, or several, or even thousand domain names to themselves, in the same way that they used to have/use an address, a name, a nickname, a mobile directory of addresses, TV channels, file names, etc. Therefore, we are talking of a world digital ecosystem naming intrastructure of hundreds of billions of domain names plus mail names, login ID, passwords, keys, codes, etc. This is a change of several order of magnitude in the DNS and in its use.

In addition, we the users want this to be simple, sure, and secure for everyone in every language and every orthotypographic variation (what ICANN calls "variants" on the user side). Machines and systems need orthotypographic algorithms if to support orthotypography.

These algorithms will be directly or indirectly used in the Variants, Stringprep replacement, IDNA support on the user side, extended services naming, IUsers' expectations, etc. cases. They MUST lead to precise results, as precise as mathematical algorithms however they obey an entirely different logic as they also depend on brain to brain level relations.

We know from IDNA2008, as exemplified in RFC 5895, that such a semantic logic is to be supported on a fringe to fringe layer (i.e. outside and above the DNS itself). Then, we need to understand it and how it can be best supported. To that end, we need to get everyone who shares this burden to try to understand this logic in common and check that it applies in all of their cases.


ICANN position on Experimentation (ICP-3)

Experimentation has always been an essential component of the Internet's vitality. Working within the system does not preclude experimentation, including experimentation with alternate DNS roots. But these activities must be done responsibly, in a manner that does not disrupt the ongoing activities of others and that is managed according to experimental protocols.

DNS experiments should be encouraged. Experiments, however, almost by definition have certain characteristics to avoid harm:

  • (a) they are clearly labeled as experiments,
  • (b) it is well understood that these experiments may end without establishing any prior claims on future directions,
  • (c) they are appropriately coordinated within a community-based framework (such as the IETF), and
  • (d) the experimenters commit to adapt to consensus-based standards when they emerge through the ICANN and other community-based processes.

[] Moreover, it is essential that experimental operations involving alternate DNS roots be conducted in a controlled manner, so that they do not adversely affect those who have not consented to participate in them. Given the design of the DNS, and particularly the intermediate-host and cache poisoning issues [], special care must be taken to insulate the DNS from the alternate root's effects. For example, alternate roots are commonly operated by large organizations within their private networks without harmful effects, since care is taken to prevent the flow of the alternate resource records onto the public Internet.

It should be noted that the original design of the DNS provides a facility for future extensions that accommodates the possibility of safely deploying multiple roots on the public Internet for experimental and other purposes. As noted in RFC 1034, the DNS includes a "class" tag on each resource record, which allows resource records of different classes to be distinguished even though they are commingled on the public Internet. For resource records within the authoritative root-server system, this class tag is set to "IN"; other values have been standardized for particular uses, including 255 possible values designated for "private use" that are particularly suited to experimentation.

As described in a recent proposal within the IETF this "class" facility allows an alternate DNS namespace to be operated from different root servers in a manner that does not interfere with the stable operation of the existing authoritative root-server system. To take advantage of this facility, it should be noted, requires the use of client or applications software developed for the alternate namespace (presumably deployed after responsible testing), rather than the existing software that has been developed to interoperate with the authoritative root.[]

In an ever-evolving Internet, ultimately there may be better architectures for getting the job done where the need for a single, authoritative root will not be an issue. But that is not the case today (9 july 2001). And the transition to such an architecture, should it emerge, would require community-based approaches. In the interim, responsible experimentation should be encouraged, but it should not be done in a manner that affects those who do not consent after being informed of the character of the experiment.


RFC 5892 to be edited IRT. IANA support

This is an exerpt from a John Klensin's Draft (Clarified IANA Considerations for IDNA)about the RFC 5892 he authored.


As part of the IDNA package [RFC5890], the IANA Considerations Section of RFC 5892 [RFC5892] specified an "IDNA Derived Properties" registry. Experience with that specification demonstrated it to be insufficiently clear on several details. This document respecifies that registry to eliminate any confusion. In particular, clarifications are required for the following:

  • Preservation of tables
    The registry actually consists of one table per Unicode version starting with the Unicode 5.2 table that was included in RFC 5892.
  • Identification of tables
    Each table in the registry will be identified with a reference to the particular Unicode version from which it was generated and the date on which it was installed.
  • Timing and Mechanism for Generating Tables
    The text in RFC 5982 is not clear about who is responsible for generating tables and when new tables are to be installed and this requires clarification.


NOTE of the author

As of the time of posting the first draft of this document, there had been no real discussion in the community of the mechanisms and timing for getting new versions posted. The suggestion made below is very tentative and intended to encourage discussion. It tries to preserve the intent and discussions leading up to RFC 5892 to the extent possible by putting the burden of deciding when new table versions are appropriate on the expert reviewer. Itj also makes IANA responsible for the actual work of producing and installing new tables, but identifies mechanisms by which they can get help with tables and with identifying the existence of new versions of Unicode as needed.
The procedure outlined below delays any posting of a table by IANA until either:
(i) the expert reviews signs off that no changes to RFC 5892 rules are needed
or (ii) IESG, after being notified by the expert reviewer that there is a problem, figures out how to handle things in terms of documentation and community consensus, and approves the result.
If the community would prefer posting of a table for comparison or checking purposes as soon as possible after a new Unicode version is completed, it would be possible to devise a model for "Preliminary" (just after Unicode gets through) and "Final" (after IETF is sure it is through) versions of a table with procedures for getting from one to the other.


Proposed Specific Updates to RFC 5892:

The following subsections [could be] added to the specification of RFC 5892 Section 5.1 "IDNA-Derived Property Value Registry". The numbers in parentheses are the subsection numbers the paragraphs would have if actually inserted in RFC 8582.

  • (5.1.1) The registry consists of a set of tables, one table per version of Unicode starting with Unicode 5.2 and continuing with each new version of Unicode as specified below.
  • (5.1.2) Each table in the registry shall be identified with the Unicode version, a reference to the specification of that version, and the date the table was last updated.
  • (5.1.3) As soon as feasible after a new version of Unicode is finalized and published, IANA or the expert reviewer (as they mutually agree) will generate a new version of the table. The expert reviewer will then conduct a review. If issues are found, they are brought to IESG attention as discussed in RFC 5892. If not, the expert reviewer will advise IANA to install the new table version in the registry, identifying it as described above.

Members of the community are encouraged to call new versions of Unicode to the attention of IANA and the expert reviewer. IANA is encouraged to call on the expert reviewer or others for assistance in compiling and verifying Preliminary tables as needed.


IUCG comments:

  • it would be advisable for the IESG experts on linguistic issues (langtags [RFC 5646], IDNA-Derived Property Value Registry, etc.) to follow the same modus operandi and for them to not be influent members of the Unicode Consortium to permit a real dialog between IETF and Unicode.
  • there is a need for a users' based filtering, encoding, anti-homographic-protection algorithm. The IUCG suggests that this algorithm should be semiotic and results from the lack of confusability of the sign symbols related to code points. It would consist in a graph-code table of all the non-confusable graphic symbols being used to write texts. Graph-codes would be qualified by the acknowledged scripts in order to correspond to one or several code points. Registration comparisons could then be carried at the graph-code level.


The PRECIS Handling Internationalized Strings in Protocols

Application protocols that make use of Unicode code points in protocol strings need to prepare such strings in order to perform comparison operations (e.g., for purposes of authentication or authorization). In general, this problem has been labeled the "preparation and comparison of internationalized strings" or "PRECIS".


PRECIS Problem statement

Problem Statement

  • Blobclass is useful for a string not to process at all, is a distinction from freeclass.
  • Table in Appendix A is useful.
  • Next steps is to include the stringprep references reviews.

Framework

  • SecretClass should allow symbols and punctuations, and clear guidance will be helpful. Suggestion to have an IETF BCP on i18n passwords. Given that the base class might be used by many IETF protocols, defining the allowed set of codepoints shall be defined by security community.
  • Bidi rule is refering RFC 5983.
  • Usefullness of subclassing is up to precis customers.
  • Chairs should assign someone to research and get some general answers to mappings.
  • For confusables, document should explicitly mention about registration policy.
  • Issue of multiple version of unicode in different clients and what to do. Should be written somewhere.
  • The charter talks about guidance. Might need to move around text between framework and guidance document.
  • Consensus on adopting framework document as WG document.
  • Next step is to do a new rev with draft-ietf-precis and then forward to security community for input on secretclass.


The PRECIS Framework

A specification that uses this framework either can directly use the base string classes defined in this document or can subclass the base string classes as needed. This framework uses an approach similar to that of the revised internationalized domain names in applications (IDNA) technology (RFC 5890, RFC 5891, RFC 5892, RFC 5893, RFC 5894) and thus adheres to the high-level design goals described in RFC 4690, albeit for PRECIS technologies.

1. Define a small set of base string classes appropriate for common application protocol constructs such as usernames, passwords, and free-form identifiers.

2. Define each base string class in terms of Unicode code points and their properties, specifying whether each code point or character category is valid, disallowed, or unassigned.

3. Enable application protocols to subclass the base string classes, mainly to disallow particular code points that are currently disallowed in the protocol (e.g., characters with special or reserved meaning, such as "@" and "/" when used as separators within identifiers).

4. Leave various mapping operations (e.g., case preservation or lowercasing, Unicode normalization, right-to-left characters) as the responsibility of application protocols, as was done for IDNA2008 via [RFC5895].

It is expected that this framework will yield the following benefits:

  • Application protocols will be more version-agile with regard to the Unicode database.
  • Implementers will be able to share code point tables and software code across application protocols, most likely by means of code libraries.
  • End users will be able to acquire more accurate expectations about the code points that are acceptable in various contexts. Given this more uniform set of string classes, it is also expected that copy/paste operations between software implementing different application protocols will be more predictable and coherent.

Although this framework is similar to IDNA2008 and borrows some of the character categories defined in [RFC5892], it defines additional string classes and character categories to meet the needs of common application protocols.

String Classes

IDNA2008 essentially defines a base string class of "internationalized domain name" to prepare domain names and hostnames, although it does not use the term "string class".

The following additional base string classes [are proposed] for use in application protocols:

NameClass

a sequence of letters, numbers, and symbols that is used to identify or address a network entity such as a user, an account, a venue (e.g., a chatroom), an information source (e.g., a data feed), or a collection of data (e.g., a file).

SecretClass

a sequence of letters, numbers, and symbols that is

used as a secret for access to some resource on a network (e.g., a password or passphrase).

FreeClass

a sequence of letters, numbers, symbols, spaces, and other code points that is used for more expressive purposes in an application protocol (e.g., a free-form identifier such as a human-friendly nickname in a chatroom).


Each string class is defined by the following behavioral rules:

  • Valid: defines which code points and character categories are treated as valid input to preparation of the string.
  • Disallowed: defines which code points and character categories are treated as disallowed during preparation of the string.
  • Unassigned: defines application behavior in the presence of code points that are unassigned, i.e. unknown for the version of Unicode the application is built upon.
  • Directionality: defines application behavior in the presence of code points that have directionality, in particular right-to-left code points as defined in the Unicode database (see [UAX9]).
  • Casemapping: defines if case mapping is used for this class, and how the mapping is done.
  • Normalization: defines which Unicode normalization form (D, KD, C, or KC) is to be applied (see [UAX15]).

[The suggestion is that valid, disallowed, and unassigned rules are common]. Application protocols that use [these] string classes [would have to] define the directionality, casemapping, and normalization rules.


Andrew Sullivan wise suggestion (091104 16:48)

"I think you're right that it'd be nice to put together a set of guidelines for what people ought to do when operating IDNA-aware zones. There are several sets of considerations that need to be taken into account, and such advice ought indeed to be offered."

"The IDNABIS WG is not the forum that should generate that advice: hardly anybody here is a DNS operator. I in fact previously offered (in a fit of insanity) to put together a -00 draft outlining such advice. I think it should probably be taken to DNSOP to see if there's any interest. But the response to me (one with which I have considerable sympathy) was that it'd be crazy to try to produce operational suggestions for a protocol that hasn't made it out of the IETF's process yet."


Inter-label/Cross-label testing - Vint Cerf: 091104 11:11

the question of inter-label or cross-label testing was extensively discussed on the WG list and rejected as overly complex at the protocol level. As with a number of cases, the WG concluded that the registry or registrar had to be cognizant of this kind of anomaly and reject problematic registration requests.


Is there a DYN DNS related issue?

  • John Klensin 20091025: If dynamic update is configured to require that an RRSET (from the viewpoint of IDNA, a label) is already present, then one has a lookup situation. If it is configured for the "RRSET does not exist" or "Name not in use" cases, then one has a registration situation. That said, my personal recommendation would be to use the more conservative Registration rules any time one is going to start modifying DNS zones rather than simply looking something up. But the WG has not discussed this topic. If people are convinced that something must be said on the subject, we will need to have that discussion.
  • Bernard Adoba 20091026 I do think that something needs to be said about this, since the issue has come up in implementation. For example, based on the distinction above, a client handling a dynamic update on its own using TKEY would implement the lookup protocol, whereas a DHCP server handling a dynamic update on behalf of the client might implement the registration protocol.
  • Vint Cerf 20091026: another alternative would be for you to issue an informational RFC about interpretation of IDNA 2008 in light of DYN DNS, would it not?


Bidi related issues

(under preparation, specialised content welcome).

For information: Bidi related issues discussed at the IETF/WG/IDNAbis LC.


Variants

Variants merit an action section by itself because its matter is common to every digital technology human interface. This is a mail of JFC Morfin on the ICANN/WG/VIP mailing list. It introduce the post-IDNA2008 variants multilinguistics issues to address.

I wish we would first clearly and commonly understand where we are and what we are currently doing before we try doing anything else. The Internet is a communication process that aims at permitting every human and machine to digitally relate together. As such, it is the most complex system ever built by humans, and the first one that has attained a universal nature.

The target

VIP wants to create some new order in the use of this system in replacing supposed bijective resolution/registration relations (one name -> one IP) with surjective relations (several variant names -> one IP). I say "supposed" because the DNS system is already surjective (the same IP can support several hosts - HTTP.1.1.). This means that the bijection is today "one name -> one IP + one name -> one host". If we want to be complete, the communication of multicast addressing is intensely injective (one name -> multiple hosts).

IDNA issues

We also have an additional problem, which is IDNA. IDNA introduces four major issues:

  • punycode does not transport the characteristics of a variance (what makes the variant equivalent) into ASCII. The impact of this has not been studied yet, to my knowledge, in terms of security and of the certain identification of the destination.
  • punycode is not complete. This is due to the lack of a definition at this time of the metadata injection method. This method is necessary for supporting, for example, French majuscules, what may or may not lead to a transliteration in uppercases.
  • IDNA is an incorrect architecture on the user side that has to be changed. This is because it is defined as being supported at the application level. On the client side, several applications with different versions or parameters may, therefore, resolve different "address+domain-name"s. On the host side one becomes dependent on the distant application architecture and one does not know for sure (otherwise, this is a VPN) what may happened on the User side. Anyway, the the relation becomes: "one out of several names->client punycode -> server-punycode -> IP + one out of several names -> host -> application". Sometimes the dichotomy host/application will be reduced but we have to live with it for now and be sure that it does not introduce too many discrepancies or security risks.
  • IDNA is based upon Unicode. IDNA2008 has reduced the impact of the use of Unicode and of its versioning. However, it has not eliminated the noise and limitations and constraints introduced by the use of a middle foreign system. "Foreign" in the sense that ISO 10646 was not designed to support IDNA. This means that the relation now actually becomes: "one out of several names->unicode->client punycode -> server-punycode->unicode -> IP + one out of several names -> host -> application"

IPv6

We have another important problem, which is IPv6. IPv6 provides each Internet user with:

  • a way to be independently called.
  • more IIDs (second part of the IPv6 address, that for clarity I name IDv6) than the whole existing Internet number of IP addresses. It is, therefore, possible that every user scales his/her naming scheme accordingly. There is no technical restriction to that; it is just a matter of the database size on his/her PC. Plug and Play will most probably result in such weird local name-spaces populated by different SDOs with their own possible support of variants. This should lead ICANN to publish variant support rules in a way that other SDOs can use and adapt-- and adopt a strategy that supports the transition to such a brave new naming world.

The information theory

All this is obviously subject to the information theory and to the algorithmic information theory http://en.wikipedia.org/wiki/Algorithmic_information_theory that takes into account that domain-names are information to processes and people. Let’s look at the issue as a general issue for the general DNS family of systems: DDDS. http://en.wikipedia.org/wiki/Dynamic_Delegation_Discovery_System.

The DDDS should be reversible, like the DNS. Do we want, and how do we make, such systems to be transparently reversible to variants? This means, if a variant is entered and results into an IP+host+application, how do we make sure that the reversion (reverse process operation) may not result in another variant? This calls for some additional implicit, passive, referent or active metadata (i.e. in the copper, in the header, in the context, or in the system intelligent dynamic). Our chain architecturally becomes:

"one out of several names->metadata->unicode->client punycode -> server-punycode->unicode -> metadata -> IP + one out of several names -> host -> application"

Linguistic and semiotic issues

Then, there are morphological, semantical, and pragmatical issues to be considered by the linguists. (e.g. cf. Raymond). Not a small task, but which has to be carried within the framework I describe.

Multilinguistics issues

Then, we have the multilinguistic problem of homography, i.e. finding a canonicalization algorithm to prevent the signs of a script used by a language to be confusable with signs used by the script of another language. We started from linguistic diversity and its implications and we have to control what we decide against the consequences on linguistic mutuality in the linguistic ecosystem.

Reasonning methodology

Now, what do we have that will enables us to discuss this? We have seven fundamental concepts that we can define "à la" Gregory Bateson:

  • data: the differences necessary for a process.
  • information: the differences that make a difference (Bateson).
  • variants: the differences that make no difference.
  • canonicalization: reducing the unnecessary differences.
  • consistency: the differences do not conflict.
  • protocol: what document data interchanges.
  • languages: human communication protocols.

This means that every other notion that we may need (glossary [I fully agree with Raymond here]) has to be referenced in relation to new concepts that we first have to accept as pertinent and coherent with the seven master concept above.

Why so? Because we need to ensure that we do not introduce any flaws (logic, security, etc.) to the reasoning and consequences. This is based upon RFC 1958 (we are to be ready for every possible "change" - here, a new kind of variant) and RFC 3439 (in a very large system, like the Internet, in which its naming is larger than the Internet itself as it may extend to other technologies, the prevalent principle is the principle of simplicity). Reasoning at the conceptual level gives us a better chance to keep things simple and coherent at the operational level.


Glossary

To mutually understand each others we need to use the same terms with the same meaning. We will probably discover we covered most of the work when we have worked out a common meaning for the terms we use. The current status glossary is at the Multilingualization - Glossary.


Pending issues

alias

alternative given name to a domain name (CNAME) or to an IP address (e.g. Host file). RFC 1034: "CNAME identifies the canonical name of an alias". RFC 2181: " It has been traditional to refer to the label of a CNAME record as "a CNAME". This is unfortunate, as "CNAME" is an abbreviation of "canonical name", and the label of a CNAME record is most certainly not a canonical name. It is, however, an entrenched usage. Care must therefore be taken to be very clear whether the label, or the value (the canonical name) of a CNAME resource record is intended. In this document, the label of a CNAME resource record will always be referred to as an alias."

polynym

alternative name of an address in the same or another language.

orthotypography

The term ‘orthotypography’ is seen from the viewpoint of readability. It means the correct use of typographic signs to convey the intended semantic or context. Also known as "Typographical syntax", orthotypography defines the meaning and rightful usage of typographic signs and cases. Orthotypographic rules may vary broadly from language to language, from country to country, etc.

IDNS

IDNA2008 has split the management of the naming space between the core of the network, its periphery (the Internet Use Interface - IUI) and the external users (as exemplified in RFC 5895). This has introduced a subsidiarity based system which is an integrated, intelligent, international extension of the Internet DNS that can be also used outside of the internet, with other technologies as well. It is proposed that the internet part of the whole subsidiary domain name system (SDNS) is named "IDNS". Following this logic
  • "ADN" means ASCII specific (legacy) domain name,
  • "IDN" means regular Internet domain name as supported by IDNA2008,
  • and "UDN" mean UTF-8/16/32 User domain name.

presentation layer

In the Open Systems Interconnection (OSI) communications model, the presentation layer ensures that the communications passing through are in the appropriate form for the recipient. For example, a presentation layer program may format a file transfer request inbinary code to ensure a successful file transfer. Programs in the presentation layer address three aspects of presentation:
  • Data formats - for example, Postscript, ASCII, or binary formats
  • Compatibility with the host operating system
  • Encapsulation of data into message "envelopes" for transmission through the network
An example of a program that generally adheres to the presentation layer of OSI is the program that manages the Web's Hypertext Transfer Protocol (HTTP). This program, sometimes called the HTTP daemon, usually comes included as part of an operating system. It forwards user requests passed to the Web browser on to a Web server elsewhere in the network. It receives a message back from the Web server that includes a Multi-Purpose Internet Mail Extensions (MIME) header. The MIME header indicates the kind of file (text, video, audio, and so forth) that has been received so that an appropriate player utility can be used to present the file to the user.

zonale

A file listing the parameters related to a TLD or to a domain name zone.

graphcode

Table of graphic symbols used as script characters, independently from the language being supported. It could serve as an algorithm for variants and anti-phishing.

Cybship/Cybcraft

The cybship is an image that is taken of a person with his or her digital environment, just as there are seaships or spaceships. Just as there are watercrafts or spacecrafts, a cybcraft is an image that is taken for a person with the sole online digital processor that he/she is currently using.
Using this paradigm matches the WSIS (World Summit on the Information Society) demand for a "people centered/à caractère humain/centrada en la persona" information society. It helps in thinking about the real nature of the world digital ecosystem, of which the Internet is a major part, as being a complex (i.e. intricated) network obeying the Cosmological principle.
As being the "kubernetes" (commanding officer, pilot, helmsman, lead user, etc. along Plato's original concept) of his or her cybship, each person is to assume his/her e-empowerment and decide about the short-term operations, mid-term governance, and long-term adminance of his/her cybship: people are the free masters and commanders of their digital experience and the center of their own "e-world". That is, if they have the liberty to enact the 1st law of cybernetics: "The unit within the system with the most behavioral responses available to it controls the system"; the danger for them is some other (commercial, financial, technical, political, etc.) environmental, contextual, or foreign influences directly or indirectly providing more behavioral responses. The ship image also helps in understanding that one can sail using the influences and counter influences of units of power, the same as the waves and winds (on sea) or attractions (in space), to steer one's own tack.


Key words for use to Indicate Requirement Levels

RFC 2119 defines words to be used to indicate local requirement levels. ALFA adds the IS/ARE external absolute which is necessary to the IUI (Internet/Intelligent Use Interface) when refering to central Internet MUSTs.

IS

This word, or the term "ARE", means that the definition is an absolute fact beyond the specification reach.

IS NOT

This phrase, or the phrase "ARE NOT", means that the definition is an absolute fact beyond the specification reach.

MUST

This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.

MUST NOT

This phrase, or the phrase "SHALL NOT", mean that the definition is an absolute prohibition of the specification.

SHOULD

This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course

SHOULD NOT

This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

MAY

This word, or the adjective "OPTIONAL", mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item. An implementation which does not include a particular option MUST be prepared to interoperate with another implementation which does include the option, though perhaps with reduced functionality. In the same vein an implementation which does include a particular option MUST be prepared to interoperate with another implementation which does not include the option (except, of course, for the feature the option provides.)


Layers in supporting the lingual diversity (origin IUTF)

semiotics

the brain to brain protocol set.

Universalization

Only figures from 0 to 9 or to F are being used in protocols.

Lingualization

Strings from a single language are used in protocols.

Plurilingualization

Several languages can be indifferetntly used by the iusers.

Globalization (proposed by Unicode)

String from a single language are used in protocols together with strings of other languages through:
  • internationlization of the media: the characters used by the scripts of these other languages are supported by the medium.
  • localization: adaptation of the protocols to the locale language and othotypography.
  • taggization: indication of the language being used. It can be used for filtering.

Multilingualization

Every language is technically and socially treated the same.


DNS/IDNS/IUDNS/SDNS (origin IUTF)

There is a need to clarify the disctinction between three system extensions that are:

DNS

The domain name system that permits one to manage and resolve ASCII names (DNs) into digital values (IPv4 and IPv6 addresses, phone numbers, radio frequencies, etc.

IDNS

The International IDNA2008 oriented "International Domain Names" (IDNs) resolution system along the Internet community operational rules.

IUDNS

The Intelligent Use domain name system that permits one to manage and resolve User Domain Names (IUDNs) in any user chosen and documented format.

ML-DNS

This is an architectural concept that encapsulates the DNS, into the IDNS, into the IUDNS. Its purpose is to resolve any DN, IDN, or IUDN string, as part of the DN pile of all its possible polynyms, into its addresses in the world digital ecosystem.

SDNS

This is the generic concept of the world digital ecosystem subsidiary domain name system.


Extended Communications Model (ALFA)

A comprehensive layering of digital communications is still to be completed. This ECOMM reduced (up-down) model can help to model them better.

Telecommunications

"Layer 0". Electrical exchanges (signals) over a plug to plug physical (copper, optics, sounds, etc.) bandwidth.

Datacommunications

  • value-added service (ISO Layer 1 to 7): over network links on an end to end basis in using the data added in the packet header. Logical exchanges of dumb content (what you send is what you receive)
  • edge services: contextual content (what you receive is what you were intended to receive) managed on an edge to edge basis in routing the calls through specialized Opened Pluggable Edge Services. They are supported by two additional layers: the interoperation layers and the "network edge" layer (layer 8 and 9). These two layers are also called the "plugged layers on the user side" (PLUS, and "+" on the server side).

Metacommunications

extended services (what you receive is what the network was commanded to deliver) intelligently managed on a fringe to fringe basis through content embedded or parallel metadata exchanges. They are provided by smart local operational tasks (slots) as layer 10 supporting layer 11 as the facilitation agent. The user is layer 12, relational spaces are layer 13, and layer 14 is the world digital ecosystem (WDE).


Internet architectural principles (IETF, IUTF, ICANN, ALFA)

A few RFCs have documented the Internet architecture, research, and evolution.

Adaptation

[RFC 1958] documents the way the Internet is to be designed in order to adapt to evolution. This is the Principle of constant change: perhaps as the only principle of the Internet that should survive indefinitely.

Indefinite Growth

[RFC 3439] documents that the unifying principle is best expressed by the Simplicity Principle, which states complexity must be controlled if one hopes to efficiently scale a complex object.

Diversity

Until the IDNA2008 RFC set, diversity was addressed in adding new features to the architecture. This turned out to be an architectural limitation addressed in an "unusual" manner [RFC 5895] through the outside application of the principle of subsidiarity constrained by the necessity to fit the Internet services framework documented by RFCs at the Internet Use Interface (IUI). That IUI can be transparently implemented as:
  • "PLUS" (Pluggable layers on the User side) - usually as a fringe added intelligence.
  • "+" on the Host side (e.g. Google+) - usually as a fringe added organized system.

Evolution

  • The [ICANN/ICP-3 document states in 2001:"In an ever-evolving Internet, ultimately there may be better architectures for getting the job done where the need for a single, authoritative root will not be an issue. But that is not the case today. And the transition to such an architecture, should it emerge, would require community-based approaches. In the interim, responsible experimentation should be encouraged, but it should not be done in a manner that affects those who do not consent after being informed of the character of the experiment.'" It was not the case in 2001, it is the case a decade later.

IDNA vs IDNApplication

The main question raised by IDNA is: what guarantees the users that the IDNAPIs in the different applications work the same and lead to the same resolution? This is addressed in removing the API from applications that are in this way kept neutral to the User Domain Name encoding and to run a single IDNApplication providing a DNS Front-End common IDN Service. Such an IDNS Front-End can support a full multi-layer IDN-Pile of polynyms in the different formats being used on the local or global network.


+-----------------------------------------+
|Host                                     |
|             +-------------+             |
|             | Application |             |
|             +------+------+             |
|                    |                    |
|             +------+------+             |
|             |   Generic   |             |
|             |    Name     |             |
|             |  Resolution |             |
|             |     API     |             |
|             +------+------+             |
|                    |                    |
|   +-----+------+---+--+-------+-----+   |
|   |     |      |      |       |     |   |
| +-+-++--+--++--+-++---+---++--+--++-+-+ |
| |DNS||LLMNR||mDNS||NetBIOS||hosts||...| |
| +---++-----++----++-------++-----++---+ |
|                                         |
+-----------------------------------------+
                     |
               ______|______
              /             \
             /               \
            /      local      \
            \     network     /
             \               /
              \_____________/
                     |
            _________|_________
           /                   \
          /                     \
         /                       \
        |         Internet        |
         \                       /
          \                     /
           \___________________/
                     |
               ______|______
              /             \
             /               \
            /      local      \
            \     network     /
             \               /
              \_____________/
                     |
                     |
+-----------------------------------------+ 
|Host                                     | 
| +-+-++--+--++--+-++---+---++--+--++-+-+ |
| |DNS||LLMNR||mDNS||NetBIOS||hosts||...| | 
| +---++-----++----++-------++-----++---+ |  
|                    |                    | 
|                    | +--------+         | 
|                    | |A-labels|         | 
|                    | +--------+         |
|                    | |U-labels|         |
|                    | +--------+         |
|                    | |...     |         |
|                    | +--------+         |
|                    |                    |
|                    |  IDN Pile          | 
|                    |                    |  
|         +----------+-----------+        |
|         |    IDNApplication    |        |
|         |Name System Front End |        |
|         |        ML-DNS        |        |
|         +----------+-----------+        |
|               /    |    \               |
|            UDNs   UDNs  UDNs            |
|             /      |      \             |
|  +-------------+   |   +-------------+  |
|  | Application |   |   | Application |  |
|  +-------------+   |   +-------------+  |
|             +------+------+             |
|             | Application |             |
|             +------+------+             |
+-----------------------------------------+
(fig. 1) Conceptual model of the IDNS.
  • The ML-DNS is at the border between Internet Datacommunications (metadata are documented in the packet header) and Metacommunications (metadata are documented through metadata registries) through embedded or parallel flows. ML-DNS supports domain name embedded metacommunications syntax. It was initiated by IDNA where the "xn--" header flags the IDNA presentation.
  • variant tables or variant algorithms can be supported at the ML-DNS and result in enlarged IDN Piles tables.
  • IDN Pile tables come by classes and presentations. Applications are not for knowing as to which part of the DNS namespace will be used, since the IDN Pile supports them through the polynyms of the same address resolution. As such, RFC 6055 recommendations (Part 4) and security concerns (Part 5) are believed to be met.
  • the local centralization of the IDNS permits a local duplication of the network and relational space referents (such as the IANA).
  • Seems to address these remarks from Shaw Steele: "I don't think that applications should need to understand how DNS works. (Something of a separation of a business logic concept as probably taught in, like, CS101 - "Don't make your app know more than it has to.") IMO, it'd be nice if app developers that need to open a connection to a server had all the Punycode ugliness layered away by some nice set of DNS APIs, or even a higher level at the open connection APIs or whatever". "I realize there's an "A" in "IDNA", but if every app has to do punycode conversion themselves there's going to be tons of odd inconsistencies in what they're doing. It could also mean "tweaking" thousands of apps if the IDNA20xx rules change a little. (E.g.: like the bidi rules did this go around). The good thing about Punycode/IDN is that it enabled DNS. The bad thing is that suddenly any network app needs to become a DNS expert."


IAB RFC 6055 definitions

Domain Name

A domain name consists of a sequence of labels, conventionally written separated by dots.

IDN

An IDN is a domain name that contains one or more labels that, in turn, contain one or more non-ASCII characters. Just as with plain ASCII domain names, each IDN label must be encoded using some mechanism before it can be transmitted in network packets, stored in memory, stored on disk, etc. These encodings need to be reversible, but they need not store domain names the same way humans conventionally write them on paper. For example, when transmitted over the network in DNS packets, domain name labels are *not* separated with dots.

IDNA

Internationalized Domain Names for Applications (IDNA) is the standard that defines the use and coding of internationalized domain names for use on the public Internet [RFC 5890]. An earlier version of IDNA [RFC 3490] is now being phased out. Except where noted, the two versions are approximately the same with regard to the issues discussed in this document. However, some explanations appeared in the earlier documents that were no longer considered useful when the later revision was created; they are quoted here from the documents in which they appear. In addition, the terminology of the two versions differ somewhat; this document reflects the terminology of the current version.

UNICODE

Unicode [Unicode] is a list of characters (including non-spacing marks that are used to form some other characters), where each character is assigned an integer value, called a code point. In simple terms a Unicode string is a string of integer code point values in the range 0 to 1,114,111 (10FFFF in base 16). These integer code points must be encoded using some mechanism before they can be transmitted in network packets, stored in memory, stored on disk, etc. Some common ways of encoding these integer code point values in computer systems include UTF-8, UTF-16, and UTF-32. In addition to the material below, those forms and the tradeoffs among them are discussed in Chapter 2 of The Unicode Standard.

UTF-8

UTF-8 is a mechanism for encoding a Unicode code point in a variable number of 8-bit octets, where an ASCII code point is preserved as-is.
Those octets encode a string of integer code point values, which represent a string of Unicode characters. The authoritative definition of UTF-8 is in Sections 3.9 and 3.10 of The Unicode Standard, but a description of UTF-8 encoding can also be found in RFC 3629. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1.

UTF-16

UTF-16 is a mechanism for encoding a Unicode code point in one or two 16-bit integers, described in detail in Sections 3.9 and 3.10 of The Unicode Standard [Unicode]. A UTF-16 string encodes a string of integer code point values that represent a string of Unicode characters.

UTF-32

UTF-32 (formerly UCS-4), also described in Sections 3.9 and 3.10 of The Unicode Standard [Unicode], is a mechanism for encoding a Unicode code point in a single 32-bit integer. A UTF-32 string is thus a string of 32-bit integer code point values, which represent a string of Unicode characters. Note that UTF-16 results in some all-zero octets when code points occur early in the Unicode sequence, and UTF-32 always has all-zero octets.

Valid labels

IDNA specifies validity of a label, such as what characters it can contain, relationships among them, and so on, in Unicode terms. Valid labels can be in either "U-label" or "A-label" form, with the appropriate one determined by particular protocols or by context.

U-label form

U-label form is a direct representation of the Unicode characters using one of the encoding forms discussed above. This document discusses UTF-8 strings in many places. While all U-labels can be represented by UTF-8 strings, not all UTF-8 strings are valid U-labels (see Section 2.3.2 of the IDNA Definitions document [RFC 5890] for a discussion of these distinctions).

A-label form

A-label form uses a compressed, ASCII-compatible encoding (an "ACE" in IDNA and other terminology) produced by an algorithm called Punycode. U-labels and A-labels are duals of each other: transformations from one to the other do not lose information. The transformation mechanisms are specified in the IDNA Protocol document [RFC5891].

Punycode

Punycode [RFC 3492] is thus a mechanism for encoding a Unicode string in an ASCII-compatible encoding, i.e., using only letters, digits, and hyphens from the ASCII character set. When a Unicode label that is valid under the IDNA rules (a U-label) is encoded with Punycode for IDNA purposes, it is prefixed with "xn--"; the result is called an A-label. The prefix convention assumes that no other DNS labels (at least no other DNS labels in IDNA-aware applications) are allowed to start with these four characters. Consequently, when A-label encoding is assumed, any DNS labels beginning with "xn--" now have a different meaning (the Punycode encoding of a label containing one or more non-ASCII characters) or no defined meaning at all (in the case of labels that are not IDNA-compliant, i.e., are not well-formed A-labels).

ISO-2022-JP

ISO-2022-JP [RFC1468] is a mechanism for encoding a string of ASCII and Japanese characters, where an ASCII character is preserved as-is.


ISO-2022-JP

ISO-2022-JP is stateful: special sequences are used to switch between character coding tables. As a result, if there are lost or mangled characters in a character stream, it is extremely difficult to recover the original stream after such a lost character encoding shift.

Comparison of Unicode strings

Comparison of Unicode strings is not as easy as comparing ASCII strings. First, there are a multitude of ways to represent a string of Unicode characters. Second, in many languages and scripts, the actual definition of "same" is very context-dependent. Because of this, comparison of two Unicode strings must take into account how the Unicode strings are encoded. Regardless of the encoding,however, comparison cannot simply be done by comparing the encoded Unicode strings byte by byte. The only time that is possible is when the strings are both mapped into some canonical form and encoded same way.

Encodings

In 1996 the IAB sponsored a workshop on character sets and encodings [RFC 2130]. This document adds to that discussion and focuses on the importance of agreeing on a single encoding and how complicated the state of affairs ends up being as a result of using different encodings today.
Different applications, APIs, and protocols use different encoding schemes today. Many of them were originally defined to use only ASCII. Internationalizing Domain Names in Applications (IDNA) [RFC 5890] defines a mechanism that requires changes to applications, but in an attempt not to change APIs or servers, specifies that the A-label format is to be used in many contexts. In some ways this could be seen as not changing the existing APIs, in the sense that the strings being passed to and from the APIs are still apparently ASCII strings. In other ways it is a very profound change to the existing APIs, because while those strings are still syntactically valid ASCII strings, they no longer mean the same thing that they used to. What looks like a plain ASCII string to one piece of software or library could be seen by another piece of software or library (with the application of out-of-band information) to be in fact an encoding of a Unicode string.


ICANN/VIP propositions

Abstract Character

A unit of information used for the organization, control, or representation of textual data. (Unicode Standard, section 3.4, D7)
Abstract character: A unit of information used for the organization, control, or representation of textual data.
  • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats.
  • An abstract character has no concrete form and should not be confused with a glyph.
  • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme.
  • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
  • Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.

A-label

An ASCII-Compatible Encoding form of an IDNA-valid string. It must be a complete label: IDNA is defined for labels, not for parts of them and not for complete domain names. This means, by definition, that every A-label will begin with the IDNA ACE prefix, "xn--", followed by a string that is a valid output of the Punycode algorithm (RFC 3492) and hence a maximum of 59 ASCII characters in length. The prefix and string together must conform to all requirements for a label that can be stored in the DNS including conformance to the rules for LDH labels (See RFC 5390, Section RFC 2.3.1). If and only if a string meeting the above requirements can be decoded into a U-label is it an A-label. (RFC 5890)

Allocation

In a DNS context, the first step on the way to Delegation. A registry (the parent side) is managing a zone. The registry makes an administrative association between a string and some entity that requests the string, making the string a label inside the zone, and a candidate for delegation. Allocation does not affect the DNS itself at all.

Assigned Code Point

A mapping from an Abstract Character to a particular Code Point in the code space. See Unicode Standard, section 2.4. Not to be confused with Valid Code Point.

Character Variant

In a Language Variant Table, a second list of Code Points corresponding to each Valid Code Point and providing possible substitutions for it. Unlike the Preferred Variants, substitutions based on Character Variants are normally reserved but not actually registered (or "activated"). Character Variants appear in column 3 of the Language Variant Table. The term "Code Point Variants" is used interchangeably with this term. (RFC 3743)

Character Variant Label

A U-label generated from a Fundamental Label by use of Character Variants. The Character Variant Label must contain at least one Character Variant, but need not contain all the Character Variants possible for the Fundamental Label. This definition differs from that in RFC 3743 by specifying “U-label” rather than “label”.

Code Point

A value in the Unicode code space. The meaning here is restricted to meaning D10 in the Unicode Standard, section 3.4.

Delegation

In a DNS context, the act of entering parent-side NS (nameserver) records in a zone, thereby creating a subordinate namespace with its own SOA (start of authority) record. See RFC 1034 for detailed discussion of how the DNS name space is broken up into zones.

Fundamental Label

A U-label that consists only of Valid Code Points. In practice, this is the U-label requested to be registered.

Fundamental TLD

The Fundamental Label form of a Variant TLD Set.

IDNA Symmetry Constraint

A-label/U-label transformation must be symmetric: an A-label A1 must be capable of being produced by conversion from a U-label U1, and that U-label U1 must be capable of being produced by conversion from A-label A1. (RFC 5890)

Language Character Repertoire

A set of Code Points identified by some identifier (such as a tag for identifying language as defined in RFC 5646). The definition of the Language Character Repertoire is ideally performed in a way appropriate to some community of language users, and might colloquially be understood as “the characters used to write a language”. In most cases, all the Code Points in a Language Character Repertoire will come from the same Script Table.

Language Variant Table

A three-column table for each Language Character Repertoire permitted to be registered in a zone. The columns are known, respectively, as "Valid Code Point", "Preferred Variant", and "Character Variant", which are defined separately. (This definition differs from RFC 3743 in the subsitution of Language Character Repertoire for “language”.) Note that in the rest of this document "Table" and "Variant Table" are not used as short forms for Language Variant Table, as they are in RFC 3743. Note also that it is logically possible a
U-label would be consistent with more than one Language Variant Table. What to do in such a case is a matter of registry policy.

Preferred Variant

In a Language Variant Table, a list of Code Points corresponding to each Valid Code Point and providing possible substitutions for it. These substitutions are "preferred" in the sense that the variant labels generated using them are normally registered in the zone file, or "activated." The Preferred Code Points appear in column 2 of the Language Variant Table. "Preferred Code Point" is used interchangeably with this term. (RFC 3743)

Preferred Variant Label

A U-label generated by use of Preferred Variants. The Preferred Variant Label must contain at least one Preferred Variant, but need not contain all the Preferred Variants possible for the Fundamental Label. This definition differs from that in RFC 3743 by specifying “U-label” rather than “label”.

Preferred Variant TLD

The Preferred Variant Label form(s) of a Variant TLD Set.

Reserved Variant TLD

The Character Variant Label form(s) of a Variant TLD Set.

Script Table

A Script Table is a table of Unicode Code Points all having the same script property value. See Unicode Standard Annex #24.

U-label

An IDNA-valid string of Unicode Code Points, in Normalization Form C (NFC) and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (such as UTF-8). It is also subject to the constraints about permitted characters that are specified in Section 4.2 of RFC 5891 and the rules in the Sections 2 and 3 of RFC 5892, the Bidi constraints in RFC 5893 if it contains any character from scripts that are written right to left, and the IDNA Symmetry Constraint. (RFC 5890)

Valid Code Point

In a Language Variant Table, the list of Code Points that is permitted at registration time for that language. Any other Code Points, or any string containing them, will be rejected. The Valid Code Point list appears as the first column of the Language Variant Table. (RFC 3743) Note that Valid Code Points are always both Assigned Code Points and Variant Members.

Variant Character Collection

All the characters listed in a single row of a Language Variant Table, as any of Valid Code Point, Preferred Variant, or Character Variant. (RFC 3743) It is important to recognize that the relationship may not be reciprocal (that is, if foo is a Valid Code Point and bar is a Character Variant, that does not mean that foo is a Character Variant for Valid Code Point bar).

Variant Label Set

A set of U-labels consisting of one Fundamental Label, zero or more Preferrred Variant Labels, and zero or more Character Variant Labels.

Variant Members

Code Points that appear in a Language Variant Table. The code point may appear in any of the Valid Code Point, Preferred Variant, or Character Variant positions.

Variant TLD

A Variant Domain Name Label corresponding to an A-label that appears or is intended to appear immediately below the root in the global DNS. Note that this definition includes TLDs that do not actually exist in the DNS at a given point in time. More informally, a Variant Domain Name Label that appears or intended to appear immediately below the DNS root. Because the actual labels in the DNS are all A-labels, this informal use is not strictly true; but because A-labels and U-labels are symmetric, it amounts to the same thing.

Variant TLD Set

A set of Variant TLDs consisting of one Fundamental Label, zero or more Preferred Variant Labels, and zero or more Character Variant Labels.


Fundamental Terms (IAB)

This section covers basic topics that are needed for almost anyone who is involved with making IETF protocols more friendly to non-ASCII text (see Section 4.2) and with other aspects of internationalization.

language

A language is a way that humans interact. The use of language occurs in many forms, the most common of which are speech, writing, and signing. [NONE] Some languages have a close relationship between the written and spoken forms, while others have a looser relationship. The so- called LTRU (Language Tag Registry Update) standards [RFC5646] [RFC4647] discuss languages in more detail and provides identifiers for languages for use in Internet protocols. Note that computer languages are explicitly excluded from this definition.

script

A set of graphic characters used for the written form of one or more languages. [ISOIEC10646] Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han (the characters, often called ideographs after a subset of them, used in writing Chinese, Japanese, and Korean). RFC 2277 discusses scripts in detail.
It is common for internationalization novices to mix up the terms "language" and "script". This can be a problem in protocols that differentiate the two. Almost all protocols that are designed (or were re-designed) to handle non-ASCII text deal with scripts (the written systems) or characters, while fewer actually deal with languages.
A single name can mean either a language or a script; for example, "Arabic" is both the name of a language and the name of a script.
In fact, many scripts borrow their names from the names of languages. Further, many scripts are used for many languages; for example, the Russian and Bulgarian languages are written in the Cyrillic script. Some languages can be expressed using different scripts or were used with different scripts at different times; the Mongolian language can be written in either the Mongolian or Cyrillic scripts; Malay is primarily written in Latin script today but the earlier, Arabic-script-based, Jawa form is still in use; and a number of languages were converted from other scripts to Cyrillic in the first half of the last century, some of which have switched again more recently. Further, some languages are normally expressed with more than one script at the same time; for example, the Japanese language is normally expressed in the Kanji (Han), Katakana, and Hiragana scripts in a single string of text.


writing system

A set of rules for using one or more scripts to write a particular language. Examples include the American English writing system, the British English writing system, the French writing system, and the Japanese writing system. [UNICODE] character A member of a set of elements used for the organization, control, or representation of data. [ISOIEC10646] There are at least three common definitions of the word "character":
  • a general description of a text entity
  • a unit of a writing system, often synonymous with "letter" or similar terms, but generalized to include digits and symbols of various sorts
  • the encoded entity itself When people talk about characters, they usually intend one of the first two definitions.
A particular character is identified by its name, not by its shape. A name may suggest a meaning, but the character may be used for representing other meanings as well. A name may suggest a shape, but that does not imply that only that shape is commonly used in print, nor that the particular shape is associated only with that name.

coded character

A character together with its coded representation. [ISOIEC10646] coded character set A coded character set (CCS) is a set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation.

[ISOIEC10646] character encoding form

A character encoding form is a mapping from a coded character set (CCS) to the actual code units used to represent the data.

[UNICODE] repertoire

The collection of characters included in a character set. Also called a character repertoire.

[UNICODE] glyph

A glyph is an abstract form that represents one or more glyph images. The term "glyph" is often a synonym for glyph image, which is the actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface. In displaying character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.

[UNICODE] glyph code

A glyph code is a numeric code that refers to a glyph. Usually, the glyphs contained in a font are referenced by their glyph code.
Glyph codes are local to a particular font; that is, a different font containing the same glyphs may use different codes. [UNICODE] transcoding Transcoding is the process of converting text data from one character encoding form to another. Transcoders work only at the level of character encoding and do not parse the text. Note:
Transcoding may involve one-to-one, many-to-one, one-to-many or many-to-many mappings. Because some legacy mappings are glyphic, they may not only be many-to-many, but also unordered: thus XYZ may map to yxz. [CHARMOD] In this definition, "many-to-one" means a sequence of characters mapped to a single character. The "many" does not mean alternative characters that map to the single character.

character encoding scheme

A character encoding scheme (CES) is a character encoding form plus byte serialization. There are many character encoding schemes in Unicode, such as UTF-8 and UTF-16BE. [UNICODE] Some CESs are associated with a single CCS; for example, UTF-8 [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and Unicode. Other CESs, such as ISO 2022, are associated with many CCSs.

charset

A charset is a method of mapping a sequence of octets to a sequence of abstract characters. A charset is, in effect, a combination of one or more CCSs with a CES. Charset names are registered by the IANA according to procedures documented in [RFC2978]. [NONE] Many protocol definitions use the term "character set" in their descriptions. The terms "charset" or "character encoding scheme" and "coded character set" are strongly preferred over the term "character set" because "character set" has other definitions in other contexts and this can be confusing.

internationalization

In the IETF, "internationalization" means to add or improve the handling of non-ASCII text in a protocol. [NONE] Many protocols that handle text only handle one charset (US- ASCII), or leave the question of what CCS and encoding are used up to local guesswork (which leads, of course, to interoperability problems). If multiple charsets are permitted they must be explicitly identified [RFC2277]. Adding non-ASCII text to a protocol allows the protocol to handle more scripts, hopefully all of the ones useful in the world. In today's world, that is normally best accomplished by allowing Unicode encoded in UTF-8 only, thereby shifting conversion issues away from individual choices.

localization

The process of adapting an internationalized application platform or application to a specific cultural environment. In localization, the same semantics are preserved while the syntax may be changed. [FRAMEWORK] Localization is the act of tailoring an application for a different language or script or culture. Some internationalized applications can handle a wide variety of languages. Typical users only understand a small number of languages, so the program must be tailored to interact with users in just the languages they know.


The major work of localization is translating the user interface and documentation. Localization involves not only changing the language interaction, but also other relevant changes such as display of numbers, dates, currency, and so on. The better internationalized an application is, the easier it is to localize it for a particular language and character encoding scheme.
Localization is rarely an IETF matter, and protocols that are merely localized, even if they are serially localized for several locations, are generally considered unsatisfactory for the global Internet.
Do not confuse "localization" with "locale", which is described in Section 8 of this document.

i18n, l10n

These are abbreviations for "internationalization" and "localization". "18" is the number of characters between the "i" and the "n" in "internationalization", and "10" is the number of characters between the "l" and the "n" in "localization".

multilingual

The term "multilingual" has many widely-varying definitions and thus is not recommended for use in standards. Some of the definitions relate to the ability to handle international characters; other definitions relate to the ability to handle multiple charsets; and still others relate to the ability to handle multiple languages.

[NONE] displaying and rendering text

To display text, a system puts characters on a visual display device such as a screen or a printer. To render text, a system analyzes the character input to determine how to display the text.
The terms "display" and "render" are sometimes used interchangeably. Note, however, that text might be rendered as audio and/or tactile output, such as in systems that have been designed for people with visual disabilities. [NONE] Combining characters modify the display of the character (or, in some cases, characters) that precede them. When rendering such text, the display engine must either find the glyph in the font that represents the base character and all of the combining characters, or it must render the combination itself. Such rendering can be straight-forward, but it is sometimes complicated when the combining marks interact with each other, such as when there are two combining marks that would appear above the same character. Formatting characters can also change the way that a renderer would display text. Rendering can also be difficult for some scripts that have complex display rules for base characters, such as Arabic and Indic scripts.


Standards bodies

ISO and ISO/IEC JTC 1

The International Organization for Standardization has been involved with standards for characters since before the IETF was started. ISO is a non-governmental group made up of national bodies. Most of ISO's work in information technology is performed jointly with a similar body, the International Electrotechnical Commission (IEC) through a joint committee known as "JTC 1". ISO and ISO/IEC JTC 1 have many diverse standards in the international characters area; the one that is most used in the IETF is commonly referred to as "ISO/IEC 10646", sometimes with a specific date.
ISO/IEC 10646 describes a CCS that covers almost all known written characters in use today.
ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC 1/SC 2 WG2", often called "SC2/WG2" or "WG2" for short. ISO standards go through many steps before being finished, and years often go by between changes to the base ISO/IEC 10646 standard although amendments are now issued to track Unicode changes.
Information on WG2, and its work products, can be found at <http://www.dkuug.dk/JTC1/SC2/WG2/>. Information on SC2, and its work products, can be found at <http://www.iso.org/iso/ standards_development/technical_committees/ list_of_iso_technical_committees/ iso_technical_committee.htm?commid=45050> The standard comes as a base part and a series of attachments or amendments. It is available in PDF form for downloading or in a CD-ROM version. One example of how to cite the standard is given in [RFC3629]. Any standard that cites ISO/IEC 10646 needs to evaluate how to handle the versioning problem that is relevant to the protocol's needs.
ISO is responsible for other standards that might be of interest to protocol developers concerned about internationalization. ISO 639 [ISO639] specifies the names of languages and forms part of the basis for the IETF's Language Tag work [RFC5646]. ISO 3166 [ISO3166] specifies the names and code abbreviations for countries and territories and is used in several protocols and databases including names for country-code top level domain names. The responsibilities of ISO TC 46 on Information and Documentation <http://www.iso.org/iso/standards_development/ technical_committees/list_of_iso_technical_committees/ iso_technical_committee.htm?commid=48750> include a series of standards for transliteration of various languages into Latin characters.
Another relevant ISO group was JTC 1/SC22/WG20, which was responsible for internationalization in JTC1, such as for international string ordering. Information on WG20, and its work products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/>.
The specific tasks of SC22/WG20 were moved from SC22 into SC2 and there has been little significant activity since that occurred.

Unicode Consortium

The second important group for international character standards is the Unicode Consortium. The Unicode Consortium is a trade association of companies, governments, and other groups interested in promoting the Unicode Standard [UNICODE]. The Unicode Standard is a CCS whose repertoire and code points are identical to ISO/IEC 10646. The Unicode Consortium has added features to the base CCS which make it more useful in protocols, such as defining attributes for each character. Examples of these attributes include case conversion and numeric properties.
The actual technical and definitional work of the Unicode Consortium is done in the Unicode Technical Committee (UTC). The terms "UTC" and "Unicode Consortium" are often treated, imprecisely, as synonymous in the IETF.
The Unicode Consortium publishes addenda to the Unicode Standard as Unicode Technical Reports. There are many types of technical reports at various stages of maturity. The Unicode Standard and affiliated technical reports can be found at <http://www.unicode.org/>.
A reciprocal agreement between the Unicode Consortium and ISO/IEC JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode Standard to track each other for definitions of characters and assignments of code points. Updates, often in the form of amendments, to the former sometimes lag updates to the latter for a short period, but the gap has rarely been significant in recent years.
At the time that the IETF character set policy [RFC2277] was established and the first version of this terminology specification were published, there was a strong preference in the IETF community for references to ISO/IEC 10646 (rather than Unicode) when possible. That preference largely reflected a more general IETF preference for referencing established open international standards in preference to specifications from consortia. However, the Unicode definitions of character properties and classes are not part of ISO/IEC 10646. Because IETF specifications are increasingly dependent on those definitions (for example, see the explanation in Section 4.2) and the Unicode specifications are freely available online in convenient machine-readable form, the IETF's preference has shifted to referencing the Unicode Standard. The latter is especially important when version consistency between code points (either standard) and Unicode properties (Unicode only) is required.

World Wide Web Consortium (W3C)

This group created and maintains the standard for XML, the markup language for text that has become very popular. XML has always been fully internationalized so that there is no need for a new version to handle international text. However, in some circumstances, XML files may be sensitive to differences among Unicode versions.

local and regional standards organizations

Just as there are many native CCSs and charsets, there are many local and regional standards organizations to create and support them. Common examples of these are ANSI (United States), CEN/ISSS (Europe), JIS (Japan), and SAC (China).


Encodings and Transformation Formats of ISO/IEC 10646

Characters in the ISO/IEC 10646 CCS can be expressed in many ways.

Encoding forms are direct addressing methods, while transformation formats are methods for expressing encoding forms as bits on the wire.

[anchor9: Note in Draft: The current Unicode Standard, e.g., Section 2.5 of version 5, refers to UTF-8, UTF-16, and UTF-32 as "encoding forms". Consequently, the distinction made above may no longer be useful or its definition precisely correct. Comments and suggestions welcome.]

Documents that discuss characters in the ISO/IEC 10646 CCS often need to list specific characters. RFC 5137 describes the common methods for doing so in IETF documents, and these practices have been adopted by many other communities as well.


Basic Multilingual Plane (BMP)

Basic Multilingual Plane (BMP) The BMP is composed of the first 2^16 code points in ISO/IEC 10646 and contains almost all characters in contemporary use. The BMP is also called "Plane 0".

UCS-2 and UCS-4

UCS-2 and UCS-4 are the two encoding forms historically defined for ISO/IEC 10646. UCS-2 addresses only the BMP. Because many useful characters (such as many Han characters) have been defined outside of the BMP, many people consider UCS-2 to be obsolete.
UCS-4 addresses the entire range of code points from ISO/IEC 10646 (by agreement between ISO/IEC JTC1 SC2 and the Unicode Consortium, a range from 0..0x10FFFF) as 32-bit values with zero padding to the left. UCS-4 is identical to UTF-32BE (without use of a BOM (see below)); UTF-32BE is now the preferred term.

UTF-8

UTF-8 [RFC3629], is the preferred encoding for IETF protocols.
  • Characters in the BMP are encoded as one, two, or three octets.
  • Characters outside the BMP are encoded as four octets. Characters from the US-ASCII repertoire have the same on-the-wire representation in UTF-8 as they do in US-ASCII. The IETF-specific definition of UTF-8 in RFC 3629 is identical to that in recent versions of the Unicode Standard (e.g., in Section 3.9 of Version 5.2 [UNICODE]).
  • UTF-16, UTF-16BE, and UTF-16LE UTF-16, UTF-16BE, and UTF-16LE, three transformation formats described in [RFC2781] and defined in The Unicode Standard (Sections 3.9 and 16.8 of Version 5.2), are not required by any IETF standards, and are thus used much less often in protocols than UTF-8. Characters in the BMP are always encoded as two octets, and characters outside the BMP are encoded as four octets using a "surrogate pair" arrangement. The latter is not part of UCS-2, marking the difference between UTF-16 and UCS-2. The three UTF-16 formats differ based on the order of the octets and the presence or absence of a special lead-in ordering identifier called the "byte order mark" or "BOM".

UTF-32

The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a transformation format that incorporates the integer code point value right-justified in a 32 bit field. As with UTF-16, the byte order mark (BOM) can be used and UTF-32BE and UTF-32LE are defined. UTF-32 and UCS-4 are essentially equivalent and the terms are often used interchangeably.

SCSU and BOCU-1

The Unicode Consortium has defined an encoding, SCSU [UTR6], which is designed to offer good compression for typical text. A different encoding that is meant to be MIME-friendly, BOCU-1, is described in [UTN6]. Although compression is attractive, as opposed to UTF-8, neither of these (at the time of this writing) has attracted much interest.
The compression provided as a side effect of the Punycode algorithm [RFC3492] is heavily used in some contexts, especially IDNA [RFC5890], but imposes some restrictions (See also Section 7).


Native CCSs and charsets

Before ISO/IEC 10646 was developed, many countries developed their own CCSs and charsets. Some of these were adopted into international standards for the relevant scripts or writing systems. Many dozen of these are in common use on the Internet today. Examples include ISO 8859-5 for Cyrillic and Shift- JIS for Japanese scripts.

The official list of the registered charset names for use with IETF protocols is maintained by IANA and can be found at [1]. The list contains preferred names and aliases. Note that this list has historically contained many errors, such as names that are in fact not charsets or references that do not give enough detail to reliably map names to charsets.

Probably the most well-known native CCS is ASCII [US-ASCII]. This CCS is used as the basis for keywords and parameter names in many IETF protocols, and as the sole CCS in numerous IETF protocols that have not yet been internationalized. ASCII became the basis for ISO/ IEC 646 which, in turn, formed the basis for many national and international standards, such as the ISO 8859 series, that mix Basic Latin characters with characters from another script.

It is important to note that, strictly speaking, "ASCII" is a CCS and repertoire, not an encoding. The encoding used for ASCII in IETF protocols involves the seven-bit integer ASCII code point right- justified an an 8-bit field and is sometimes described as the "Network Virtual Terminal" or "NVT" encoding [RFC5198]. Less formally, "ASCII" and "NVT" are often used interchangeably. However, "non-ASCII" refers only to characters outside the ASCII repertoire and is not linked to a specific encoding. See Section 4.2.

A Unicode publication describes issues involved in mapping character data between charsets, and an XML format for mapping table data [UTR22].


Character Issues

This section contains terms and topics that are commonly used in character handling and therefore are of concern to people adding non- ASCII text handling to protocols. These topics are standardized outside the IETF.

code point

A value in the codespace of a repertoire. For all common repertoires developed in recent years, code point values are integers (code points for ASCII and its immediate descendants were defined in terms of column and row positions of a table).

combining character

A member of an identified subset of the coded character set of ISO/IEC 10646 intended for combination with the preceding non- combining graphic character, or with a sequence of combining characters preceded by a non-combining character. Combining characters are inherently non-spacing. <ISOIEC10646> composite sequence or combining character sequemce A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters. A graphic symbol for a composite sequence generally consists of the combination of the graphic symbols of each character in the sequence. The Unicode Standard often uses the term "combining character sequence" to refer to composite sequences. A composite sequence is not a character and therefore is not a member of the repertoire of ISO/IEC 10646. <ISOIEC10646> However, Unicode now assigns names to some such sequences especially when the names are required to match terminology in other standards [UAX34].
In some CCSs, some characters consist of combinations of other characters. For example, the letter "a with acute" might be a combination of the two characters "a" and "combining acute", or it might be a combination of the three characters "a", a non- destructive backspace, and an acute. In the same or other CCSs, it might be available as a single code point. The rules for combining two or more characters are called "composition rules", and the rules for taking apart a character into other characters is called "decomposition rules". The results of composition is called a "precomposed character"; the results of decomposition is called a "decomposed character".

normalization

Normalization is the transformation of data to a normal form, for example, to unify spelling. <UNICODE> Note that the phrase "unify spelling" in the definition above does not mean unifying different strings with the same meaning as words (such as "color" and "colour"). Instead, it means unifying different character sequences that are intended to form the same composite characters. such as "<n><combining tilde>" and "<n with tilde>" (where "<n>" is U+006E, "<combining tilde>" is U+0303, and "<n with tilde>" is U+00F1.
The purpose of normalization is to allow two strings to be compared for equivalence. The strings "<a><n><combining tilde><o>" and "<a><n with tilde><o>" would be shown identically on a text display device. If a protocol designer wants those two strings to be considered equivalent during comparison, the protocol must define where normalization occurs.
The terms "normalization" and "canonicalization" are often used interchangeably. Generally, they both mean to convert a string of one or more characters into another string based on standardized rules. However, in Unicode, "canonicalization" or its variants are used to refer to a particular type of normalization equivalence ("canonical equivalence") in contrast to "compatibility equivalence"), so the term should be used with some care. Some CCSs allow multiple equivalent representations for a written string; normalization selects one among multiple equivalent representations as a base for reference purposes in comparing strings. In strings of text, these rules are usually based on decomposing combined characters or composing characters with combining characters. Unicode Standard Annex #15 [UTR15] describes the process and many forms of normalization in detail.
Normalization is important when comparing strings to see if they are the same.
The Unicode NFC and NFD normalizations support canonical equivalence; NFKC and NFKD support canonical and compatibility equivalence.

case

Case is the feature of certain alphabets where the letters have two (or occasionally more) distinct forms. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). Case mapping is the association of the uppercase and lowercase forms of a letter.
<UNICODE> There is usually (but not always) a one-to-one mapping between the same letter in the two cases. However, there are many examples of characters which exist in one case but for which there is no corresponding character in the other case or for which there is a special mapping rule, such as the Turkish dotless "i", some Greek characters with modifiers, and characters like the German Sharp S (Eszett) and Greek Final Sigma that traditionally do not have uppercase forms. Case mapping can even be dependent on locale or language. Converting text to have only a single case, primarily for comparison purposes, is called "case folding". Because of the various unusual cases, case mapping can be quite controversial and some case folding algorithms even more so.

sorting and collation

Collating is the process of ordering units of textual information.
Collation is usually specific to a particular language or even to a particular application or locale. It is sometimes known as alphabetizing, although alphabetization is just a special case of sorting and collation. <UNICODE> Collation is concerned with the determination of the relative order of any particular pair of strings, and algorithms concerned with collation focus on the problem of providing appropriate weighted keys for string values, to enable binary comparison of the key values to determine the relative ordering of the strings.


The relative orders of letters in collation sequences can differ widely based on the needs of the system or protocol defining the collation order. For example, even within ASCII characters, there are two common and very different collation orders: "A, a, B, b,..." and "A, B, C, ..., Z, a, b,...", with additional variations for lower case first and digits before and after letters.
In practice, it is rarely necessary to define a collation sequence for characters drawn from different scripts, but arranging such sequences so as to not surprise users is usually particularly problematic.
Sorting is the process of actually putting data records into specified orders, according to criteria for comparison between the records. Sorting can apply to any kind of data (including textual data) for which an ordering criterion can be defined. Algorithms concerned with sorting focus on the problem of performance (in terms of time, memory, or other resources) in actually putting the data records into the desired order.
A sorting algorithm for string data can be internationalized by providing it with the appropriate collation-weighted keys corresponding to the strings to be ordered.
Many processes have a need to order strings in a consistent (sorted) sequence. For only a few CCS/CES combinations, there is an obvious sort order that can be applied without reference to the linguistic meaning of the characters: the code point order is sufficient for sorting. That is, the code point order is also the order that a person would use in sorting the characters. For many CCS/CES combinations, the code point order would make no sense to a person and therefore is not useful for sorting if the results will be displayed to a person.
Code Point order is usually not how any human educated by a local school system expects to see strings ordered; if one orders to the expectations of a human, one has a language-specific sort.
Sorting to code point order will seem inconsistent if the strings are not normalized before sorting because different representations of the same character will sort differently. This problem may be smaller with a language-specific sort.

code table

A code table is a table showing the characters allocated to the octets in a code. <ISOIEC10646> Code tables are also commonly called "code charts".


Types of Characters

The following definitions of types of characters do not clearly delineate each character into one type, nor do they allow someone to accurately predict what types would apply to a particular character.

The definitions are intended for application designers to help them think about the many (sometimes confusing) properties of text.

alphabetic

An informative Unicode property. Characters that are the primary units of alphabets and/or syllabaries, whether combining or noncombining. This includes composite characters that are canonical equivalents to a combining character sequence of an alphabetic base character plus one or more combining characters: letter digraphs; contextual variant of alphabetic characters; ligatures of alphabetic characters; contextual variants of ligatures; modifier letters; letterlike symbols that are compatibility equivalents of single alphabetic letters; and miscellaneous letter elements. <UNICODE>

ideographic

Any symbol that primarily denotes an idea (or meaning) in contrast to a sound (or pronunciation), for example, a symbol showing a telephone or the Han characters used in Chinese, Japanese, and Korean. <UNICODE> While Unicode and many other systems use this term to refer to all Han characters, strictly speaking not all of those characters are actually ideographic. Some are pictographic (such as the telephone example above), some are used phonetically, and so on.
However, the convention is to describe the script as ideographic as contrasted to alphabetic.

digit or number

All modern writing systems use decimal digits in some form; some older ones use non-positional or other systems. Different scripts may have their own digits. Unicode distinguishes between numbers and other kinds of characters by assigning a special General Category value to them and subdividing that value to distinguish between decimal digits, letter digits, and other digits.

punctuation

Characters that separate units of text, such as sentences and phrases, thus clarifying the meaning of the text. The use of punctuation marks is not limited to prose; they are also used in mathematical and scientific formulae, for example.

symbol

One of a set of characters other than those used for letters, digits, or punctuation, and representing various concepts generally not connected to written language use per se.<NONE> Examples of symbols include characters for mathematical operators, symbols for OCR, symbols for box-drawing or graphics, as well as symbols for dingbats, arrows, faces, and geometric shapes.
Unicode has a property that identifies symbol characters.

nonspacing character

A combining character whose positioning in presentation is dependent on its base character. It generally does not consume space along the visual baseline in and of itself. <UNICODE> A combining acute accent (U+0301) is an example of a nonspacing character.

diacritic

A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. They can also be marks applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). Also called diacritical mark or diacritical. <UNICODE> control character The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.
The basic space character, U+0020, is often considered as a control character as well, making the total number 66. They are also known as control codes or control characters. In terminology adopted by Unicode from ASCII and the ISO 8859 standards, these codes are treated as belonging to three ranges: "C0" (for U+0000..U+001F), "C1" (for U+0080...U+009F), and the single control character "DEL" (U+007F). <UNICODE> formatting character Characters that are inherently invisible but that have an effect on the surrounding characters. <UNICODE> Examples of formatting characters include characters for specifying the direction of text and characters that specify how to join multiple characters.

compatibility character or compatibility variant

A graphic character included as a coded character of ISO/IEC 10646 primarily for compatibility with existing coded character sets.

ISOIEC10646)> The Unicode definition of compatibility charter also includes characters that have been incorporated for other reasons. Their list includes several separate groups of characters included for compatibility purposes: halfwidth and fullwidth characters used with East Asian scripts, Arabic contextual forms (e.g., initial or final forms), some ligatures, deprecated formatting characters, variant forms of characters (or even copies of them) for particular uses (e.g., phonetic or mathematical applications), font variations, CJK compatibility ideographs, and so on. For additional information and the separate term "compatibility decomposable character", see the Unicode standard.
For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for compatibility with Asian character sets that include full-width and half-width ASCII characters.

Some efforts in the IETF have concluded that it would be useful to support mapping of some groups of compatibility equivalents and not others (e.g., supporting or mapping width variations while preserving or rejecting mathematical variations).


Differentiation of Subsets

Especially as existing IETF standards are internationalized, it is necessary to describe collections of characters including especially various subsets of Unicode. Because Unicode includes ways to code substantially all characters in contemporary use, subsets of the Unicode repertoire can be a useful tool for defining these collections as repertoires independent of specific Unicode coding.

However specific collections are defined, it is important to remember that, while older CCSs such as ASCII and the ISO 8859 family are close-ended and fixed, Unicode is open-ended, with new character definitions, and often new scripts, being added every year or so.

So, while, e.g., an ASCII subset, such as "upper case letters", can be specified as a range of code points (4/1 to 5/10 for that example), similar definitions for Unicode either have to be specified in terms of Unicode properties or are very dependent on Unicode versions (and the relevant version must be identified in any specification). See the IDNA code point specification [RFC5892] for an example of specification by combinations of properties.

Some terms are commonly used in the IETF to define character ranges and subsets. Some of these are imprecise and can cause confusion if not used carefully.

non-ASCII

The term "non-ASCII" strictly refers to characters other than those that appear in the ASCII repertoire, independent of the CCS or encoding used for them. In practice, if a repertoire such as that of Unicode is established as context, "non-ASCII" refers to characters in that repertoire that do not appear in the ASCII repertoire. "Outside the ASCII repertoire" and "outside the ASCII range" are practical, and more precise, synonyms for "non-ASCII".

letters

The term "letters" does not have an exact equivalent in the Unicode standard. Letters are generally characters that are used to write words, but that means very different things in different languages and cultures.


User Interface for Text

Although the IETF does not standardize user interfaces, many protocols make assumptions about how a user will enter or see text that is used in the protocol. Internationalization challenges assumptions about the type and limitations of the input and output devices that may be used with applications that use various protocols. It is therefore useful to consider how users typically interact with text that might contain one or more non-ASCII characters.

input methods

An input method is a mechanism for a person to enter text into an application. <NONE> Text can be entered into a computer in many ways. Keyboards are by far the most common device used, but many characters cannot be entered on typical computer keyboards in a single stroke. Many operating systems come with system software that lets users input characters outside the range of what is allowed by keyboards.
For example, there are dozens of different input methods for Han characters in Chinese, Japanese, and Korean. Some start with phonetic input through the keyboard, while others use the number of strokes in the character. Input methods are also needed for scripts that have many diacritics, such as European or Vietnamese characters that have two or three diacritics on a single alphabetic character.
The term "input method editor" (IME) is often used generically to describe the tools and software used to deal with input of characters on a particular system.

rendering rules

A rendering rule is an algorithm that a system uses to decide how to display a string of text. <NONE> Some scripts can be directly displayed with fonts, where each character from an input stream can simply be copied from a glyph system and put on the screen or printed page. Other scripts need rules that are based on the context of the characters in order to render text for display.
Some examples of these rendering rules include:
  • Scripts such as Arabic (and many others), where the form of the letter changes depending on the adjacent letters, whether the letter is standing alone, at the beginning of a word, in the middle of a word, or at the end of a word. The rendering rules must choose between two or more glyphs.
  • Scripts such as the Indic scripts, where consonants may change their form if they are adjacent to certain other consonants or may be displayed in an order different from the way they are stored and pronounced. The rendering rules must choose between two or more glyphs.
  • Arabic and Hebrew scripts, where the order of the characters displayed are changed by the bidirectional properties of the alphabetic and other characters characters and with right-to- left and left-to-right ordering marks. The rendering rules must choose the order that characters are displayed.
  • Some writing systems cannot have their rendering rules suitably defined using mechanisms that are now defined in the Unicode Standard. None of those languages are in active non-scholarly use today.
  • Many systems use a special rendering rule when they lack a font or other mechanism for rendering a particular character correctly. That rule typically involves substitution of a small open box or a question mark for the missing character.
See "undisplayable character" below.

graphic symbol

A graphic symbol is the visual representation of a graphic character or of a composite sequence. <ISOIEC10646> font A font is a collection of glyphs used for the visual depiction of character data. A font is often associated with a set of parameters (for example, size, posture, weight, and serifness), which, when set to particular values, generate a collection of imagable glyphs. <UNICODE> The term "font" is often used interchangeably with "typeface". As historically used in typography, a typeface is a family of one or more fonts that share a common general design. For example, "Times Roman" is actually a typeface, with a collection of fonts such as "Times Roman Bold", "Times Roman Medium", "Times Roman Italic", and so on. Some sources even consider different type sizes within a typeface to be different fonts. While those distinctions are rarely important for internationalization purposes, there are exceptions. Those writing specifications should be very careful about definitions in cases in which the exceptions might lead to ambiguity.

bidirectional display

The process or result of mixing left-to-right oriented text and right-to-left oriented text in a single line is called bidirectional display, often abbreviated as "bidi". <UNICODE> Most of the world's written languages are displayed left-to-right.
However, many widely-used written languages such as ones based on the Hebrew or Arabic scripts are displayed primarily right-to-left (numerals are a common exception in the modern scripts). Right- to-left text often confuses protocol writers because they have to keep thinking in terms of the order of characters in a string in memory, an order that might be different from what they see on the screen. (Note that some languages are written both horizontally and vertically and that some historical ones use other display orderings.) Further, bidirectional text can cause confusion because there are formatting characters in ISO/IEC 10646 that cause the order of display of text to change. These explicit formatting characters change the display regardless of the implicit left-to-right or right-to-left properties of characters. Text that might contain those characters typically requires careful processing before being sorted or compared for equality.
It is common to see strings with text in both directions, such as strings that include both text and numbers, or strings that contain a mixture of scripts.
Unicode has a long and incredibly detailed algorithm for displaying bidirectional text [UAX9].

undisplayable character

A character that has no displayable form. For instance, the zero-width space (U+200B) cannot be displayed because it takes up no horizontal space. Formatting characters such as those for setting the direction of text are also undisplayable. Note, however, that every character in [UNICODE] has a glyph associated with it, and that the glyphs for undisplayable characters are enclosed in a dashed square as an indication that the actual character is undisplayable.
The property of a character that causes it to be undisplayable is intrinsic to its definition. Undisplayable characters can never be displayed in normal text (the dashed square notation is used only in special circumstances). Printable characters whose Unicode definitions are associated with glyphs that cannot be rendered on a particular system are not, in this sense, undisplayable.


Text in Current IETF Protocols

Many IETF protocols started off being fully internationalized, while others have been internationalized as they were revised. In this process, IETF members have seen patterns in the way that many protocols use text. This section describes some specific protocol interactions with text.

protocol elements

Protocol elements are uniquely-named parts of a protocol. <NONE> Almost every protocol has named elements, such as "source port" in TCP. In some protocols, the names of the elements (or text tokens for the names) are transmitted within the protocol. For example, in SMTP and numerous other IETF protocols, the names of the verbs are part of the command stream. The names are thus part of the protocol standard. The names of protocol elements are not normally seen by end users and it is rarely appropriate to internationalize protocol element names (even while the elements themselves can be internationalized).
name spaces A name space is the set of valid names for a particular item, or the syntactic rules for generating these valid names. Many items in Internet protocols use names to identify specific instances or values. The names may be generated (by some prescribed rules), registered centrally (e.g., such as with IANA), or have a distributed registration and control mechanism, such as the names in the DNS.

on-the-wire encoding

The encoding and decoding used before and after transmission over the network is often called the "on-the-wire" (or sometimes just "wire") format. <NONE> Characters are identified by code points. Before being transmitted in a protocol, they must first be encoded as bits and octets. Similarly, when characters are received in a transmission, they have been encoded, and a protocol that needs to process the individual characters needs to decode them before processing.

parsed text

Text strings that is analyzed for subparts. <NONE> In some protocols, free text in text fields might be parsed. For example, many mail user agents (MUAs) will parse the words in the text of the Subject: field to attempt to thread based on what appears after the "Re:" prefix.
Such conventions are very sensitive to localization. If, for example, a form like "Re:" is altered by an MUA to reflect the language of the sender or recipient, a system that subsequently does threading may not recognize the replacement term as a delimiter string.

charset identification

Specification of the charset used for a string of text. <NONE> Protocols that allow more than one charset to be used in the same place should require that the text be identified with the appropriate charset. Without this identification, a program looking at the text cannot definitively discern the charset of the text. Charset identification is also called "charset tagging".

language identification

Specification of the human language used for a string of text.
Some protocols (such as MIME and HTTP) allow text that is meant for machine processing to be identified with the language used in the text. Such identification is important for machine processing of the text, such as by systems that render the text by speaking it. Language identification is also called "language tagging".
The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a comprehensive model for language identification.

MIME

MIME (Multipurpose Internet Mail Extensions) is a message format that allows for textual message bodies and headers in character sets other than US-ASCII in formats that require ASCII (most notably RFC 5322, the standard for Internet mail headers [RFC5322]). MIME is described in RFCs 2045 through 2049, as well as more recent RFCs. <NONE> transfer encoding syntax A transfer encoding syntax (TES) (sometimes called a transfer encoding scheme) is a reversible transform of already-encoded data that is represented in one or more character encoding schemes.
TESs are useful for encoding types of character data into an another format, usually for allowing new types of data to be transmitted over legacy protocols. The main examples of TESs used in the IETF include Base64 and quoted-printable. MIME identifies the transfer encoding syntax for body parts as a Content-transfer- encoding, occasionally abbreviated C-T-E.

Base64

Base64 is a transfer encoding syntax that allows binary data to be represented by the ASCII characters A through Z, a through z, 0 through 9, +, /, and =. It is defined in [RFC2045]. <NONE> quoted printable Quoted printable is a transfer encoding syntax that allows strings that have non-ASCII characters mixed in with mostly ASCII printable characters to be somewhat human readable. It is described in [RFC2047]. <NONE> The quoted printable syntax is generally considered to be a failure at being readable. It is jokingly referred to as "quoted unreadable".

XML

XML (which is an approximate abbreviation for Extensible Markup Language) is a popular method for structuring text. XML text is explicitly tagged with charsets. The specification for XML can be found at <http://www.w3.org/XML/>. <NONE> ASN.1 text formats The ASN.1 data description language has many formats for text data. The formats allow for different repertoires and different encodings. Some of the formats that appear in IETF standards based on ASN.1 include IA5String (all ASCII characters), PrintableString (most ASCII characters, but missing many punctuation characters), BMPString (characters from ISO/IEC 10646 plane 0 in UTF-16BE format), UTF8String (just as the name implies), and TeletexString (also called T61String).

ASCII-compatible encoding (ACE)

Starting in 1996, many ASCII-compatible encoding schemes (which are actually transfer encoding syntaxes) have been proposed as possible solutions for internationalizing host names and some other purposes. Their goal is to be able to encode any string of ISO/IEC 10646 characters using the preferred syntax for domain names (as described in STD 13). At the time of this writing, only the ACE encoding produced by Punycode [RFC3492] has become an IETF standard.
The choice of ACE forms to internationalize legacy protocols must be made with care as it can cause some difficult side effects [RFC6055].

LDH label

The classical label form used in the DNS and most applications that call on it, albeit with some additional restrictions, reflects the early syntax of "hostnames" [RFC0952] and limits those names to ASCII letters, digits, and embedded hyphens. The hostname syntax is identical to that described as the "preferred name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by RFC 1123 [RFC1123]. LDH labels are defined in a more restrictive and precise way for internationalization contexts as part of the IDNA2008 specification [RFC5890].


Terms Associated with Internationalized Domain Names

The current specification for Internationalized Domain Names (IDNs), known formally as Internationalized Domain Names for Applications or IDNA, is referred to in the IETF and parts of the broader community as "IDNA2008" and consists of several documents. Section 2.3 of the first of those documents, commonly known as "IDNA2008 Definitions" [RFC5890] provides definitions and introduces some specialized terms for differentiating among types of DNS labels in an IDN context.

Those terms are listed in the table below; see RFC 5890 for the specific definitions if needed.

ACE Prefix

The "ACE prefix" is defined in this document to be a string of ASCII characters, "xn--", that appears at the beginning of every A-label. "ACE" stands for "ASCII-Compatible Encoding".

A-label

An "A-label" is the ASCII-Compatible Encoding (ACE) form of an IDNA-valid string. It must be a complete label: IDNA is defined for labels, not for parts of them and not for complete domain names. This means, by definition, that every A-label will begin with the IDNA ACE prefix, "xn--" (see Section 2.3.2.5), followed by a string that is a valid output of the Punycode algorithm [RFC3492] and hence a maximum of 59 ASCII characters in length. The prefix and string together must conform to all requirements for a label that can be stored in the DNS including conformance to the rules for LDH labels. If and only if a string meeting the above requirements can be decoded into a U-label is it an A-label.

Domain Name Slot

A "domain name slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying a domain name. Examples of domain name slots include the QNAME field of a DNS query; the name argument of the gethostbyname() or getaddrinfo() standard C library functions; the part of an email address following the at sign ("@") in the parameter to the SMTP MAIL or RCPT commands or the "From:" field of an email message header; and the host portion of the URI in the "src" attribute of an HTML "<IMG>" tag. A string that has the syntax of a domain name but that appears in general text is not in a domain name slot. For example, a domain name appearing in the plain text body of an email message is not occupying a domain name slot.
An "IDNA-aware domain name slot" is defined for this set of documents to be a domain name slot explicitly designated for carrying an internationalized domain name as defined in this document. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session).
Name slots that are not IDNA-aware obviously include any domain name slot whose specification predates IDNA. Note that the requirements of some protocols that use the DNS for data storage prevent the use of IDNs. For example, the format required for the underscore labels used by the service location protocol [RFC2782] precludes representation of a non-ASCII label in the DNS using A-labels because those SRV-related labels must start with underscores. Of course, non-ASCII IDN labels may be part of a domain name that also includes underscore labels.

IDNA-valid string

A string is "IDNA-valid" if it meets all of the requirements of these specifications for an IDNA label. IDNA-valid strings may appear in either of the two forms defined immediately below, or may be drawn from the NR-LDH label subset. IDNA-valid strings must also conform to all basic DNS requirements for labels. These documents make specific reference to the form appropriate to any context in which the distinction is important.

Internationalized Domain Name

An "internationalized domain name" (IDN) is a domain name that contains at least one A-label or U-label, but that otherwise may contain any mixture of NR-LDH labels, A-labels, or U-labels. Just as has been the case with ASCII names, some DNS zone administrators may impose restrictions, beyond those imposed by DNS or IDNA, on the characters or strings that may be registered as labels in their zones. Because of the diversity of characters that can be used in a U-label and the confusion they might cause, such restrictions are mandatory for IDN registries and zones even though the particular restrictions are not part of these specifications (the issue is discussed in more detail in Section 4.3 of the Protocol document [RFC5891]. Because these restrictions, commonly known as "registry restrictions", only affect what can be registered and not lookup processing, they have no effect on the syntax or semantics of DNS protocol messages; a query for a name that matches no records will yield the same response regardless of the reason why it is not in the zone. Clients issuing queries or interpreting responses cannot be assumed to have any knowledge of zone-specific restrictions or conventions. See the section on registration policy in the Rationale document [RFC5894] for additional discussion.

Internationalized Label

"Internationalized label" is used when a term is needed to refer to a single label of an IDN, i.e., one that might be any of an NR-LDH label, A-label, or U-label. There are some standardized DNS label formats, such as the "underscore labels" used for service location (SRV) records [RFC2782], that do not fall into any of the three categories and hence are not internationalized labels.


LDH Label

This is the classical label form used, albeit with some additional restrictions, in hostnames [RFC0952]. Its syntax is identical to that described as the "preferred name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by RFC 1123 [RFC1123]. Briefly, it is a string consisting of ASCII letters, digits, and the hyphen with the further restriction that the hyphen cannot appear at the beginning or end of the string. Like all DNS labels, its total length must not exceed 63 octets.
LDH labels include the specialized labels used by IDNA (described as "A-labels" below) and some additional restricted forms (also described below).

R-LDH label

Reserved LDH labels, known as "tagged domain names" in some other contexts, have the property that they contain "--" in the third and fourth characters but which otherwise conform to LDH label rules. Only a subset of the R-LDH labels can be used in IDNA-aware applications. That subset consists of the class of labels that begin with the prefix "xn--" (case independent), but otherwise conform to the rules for LDH labels. That subset is called "XN-labels" in this set of documents. XN-labels are further divided into those whose remaining characters (after the "xn--") are valid output of the Punycode algorithm [RFC3492] and those that are not (see below). The XN-labels that are valid Punycode output are known as "A-labels" if they also meet the other criteria for IDNA-validity described below. Because LDH labels (and, indeed, any DNS label) must not be more than 63 octets in length, the portion of an XN-label derived from the Punycode algorithm is limited to no more than 59 ASCII characters.

NR-LDH label

Non-Reserved LDH labels are the set of valid LDH labels that do not have "--" in the third and fourth positions.

U-label

A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC) and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (such as UTF-8). It is also subject to the constraints about permitted characters that are specified in Section 4.2 of the Protocol document and the rules in the Sections 2 and 3 of the Tables document, the Bidi constraints in that document if it contains any character from scripts that are written right to left, and the symmetry constraint described immediately below. Conversions between U-labels and A-labels are performed according to the "Punycode" specification [RFC 3492], adding or removing the ACE prefix as needed.


Two additional terms entered the IETF's vocabulary as part of the earlier IDN effort [RFC3490] (IDNA2003):

Stringprep

Stringprep [RFC3454] provides a model and character tables for preparing and handling internationalized strings. It was used in the original IDN specification (IDNA2003) via a profile called "Nameprep" [RFC3491]. It is no longer in use in IDNA, but continues to be used in profiles by a number of other protocols.

Punycode

This is the name of the algorithm [RFC3492] used to convert otherwise-valid IDN labels from native-character strings expressed in Unicode to an ASCII-compatible encoding (ACE).
Strictly speaking, the term applies to the algorithm only. In practice, it is widely, if erroneously, used to refer to strings that the algorithm encodes.


Other Common Terms In Internationalization

This is a hodge-podge of other terms that have appeared in internationalization discussions in the IETF. It is likely that additional terms will be added as this document matures.

locale

Locale is the user-specific location and cultural information managed by a computer. <NONE> Because languages and orthographic conventions differ from country to country (and even region to region within a country), the locale of the user can often be an important factor. Typically, the locale information for a user includes the language(s) used.
Locale issues go beyond character use, and can include things such as the display format for currency, dates, and times. Some locales (especially the popular "C" and "POSIX" locales) do not include language information.
It should be noted that there are many thorny, unsolved issues with locale. For example, should text be viewed using the locale information of the person who wrote the text or the person viewing it? What if the person viewing it is travelling to different locations? Should only some of the locale information affect creation and editing of text?

Latin characters

"Latin characters" is a not-precise term for characters historically related to ancient Greek script as modified in the Roman Republic and Empire and currently used throughout the world.
The base Latin characters are a subset of the ASCII repertoire and have been augmented by many single and multiple diacritics and quite a few other characters. ISO/IEC 10646 encodes the Latin characters in including ranges U+0020..U+024F, and U+1E00..U+1EFF.
Because "Latin characters" is used in different contexts to refer to the letters from the ASCII repertoire, the subset of those characters used late in the Roman Republic period or the different subset used to write Latin in medieval times, the entire ASCII repertoire, all of the code points in the extended Latin script as defined by Unicode, and other collections, the term should be avoided in IETF specifications when possible. Similarly, "Basic Latin" should not be used as a synonym for "ASCII".

romanization

The transliteration of a non-Latin script into Latin characters.
Because of the widespread use of Latin characters, people have tried to represent many languages that are not based on a Latin repertoire in Latin. For example, there are two popular romanizations of Chinese: Wade-Giles and Pinyin, the latter of which is by far more common today. Many romanization systems are inexact and do not give perfect round trip mappings between the native script and the Latin characters.

CJK characters and Han characters

The ideographic characters used in Chinese, Japanese, Korean, and traditional Vietnamese writing systems are often called 'CJK characters' after the initial letters of the language names in English. They are also called "Han characters", after the term in Chinese that is often used for these characters. <NONE> Note that Han characters do not include the phonetic characters used in the Japanese and Korean languages. Users of the term "CJK characters" may or may not assume those additional characters are included.
In ISO/IEC 10646, the Han characters were "unified", meaning that each set of Han characters from Japanese, Chinese, and/or Korean that had the same origin was assigned a single code point. The positive result of this was that many fewer code points were needed to represent Han; the negative result of this was that characters that people who write the three languages think are different have the same code point. There is a great deal of disagreement on the nature, the origin, and the severity of the problems caused by Han unification.

translation

The process of conveying the meaning of some passage of text in one language, so that it can be expressed equivalently in another language. Many language translation systems are inexact and cannot be applied repeatedly to go from one language to another to another.

transliteration

The process of representing the characters of an alphabetical or syllabic system of writing by the characters of a conversion alphabet. Many script transliterations are exact, and many have perfect round-trip mappings. The notable exception to this is romanization, described above. Transliteration involves converting text expressed in one script into another script, generally on a letter-by-letter basis. There are many official and unofficial transliteration standards, most notably those from ISO TC 46 and the U.S. Library of Congress.

transcription

The process of systematically writing the sounds of some passage of spoken language, generally with the use of a technical phonetic alphabet (usually Latin-based) or other systematic transcriptional orthography. Transcription also sometimes refers to the conversion of written text into a transcribed form, based on the sound of the text as if it had been spoken. <NONE> Unlike transliterations, which are generally designed to be round- trip convertible, transcriptions of written material are almost never round-trip convertible to their original form, at least without some supplemental information.

regular expressions

Regular expressions provide a mechanism to select specific strings from a set of character strings. Regular expressions are a language used to search for text within strings, and possibly modify the text found with other text. <NONE> Pattern matching for text involves being able to represent one or more code points in an abstract notation, such as searching for all capital Latin letters or all punctuation. The most common mechanism in IETF protocols for naming such patterns is the use of regular expressions. There is no single regular expression language, but there are numerous very similar dialects that are not quite consistent with each other.
The Unicode Consortium has a good discussion about how to adapt regular expression engines to use Unicode. [UTR18] private use ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to U+FFFFD, and U+100000 to U+10FFFD are available for private use.
This refers to code points of the standard whose interpretation is not specified by the standard and whose use may be determined by private agreement among cooperating users. <UNICODE> The use of these "private use" characters is defined by the parties who transmit and receive them, and is thus not appropriate for standardization. (The IETF has a long history of private use names for things such as "x-" names in MIME types, charsets, and languages. Most of the experience with these has been quite negative, with many implementors assuming that private use names are in fact public and long-lived.)


IANA General Glossary

A record

The representation of an IPv4 address in the DNS system.

AAAA record

The representation of an IPv6 address in the DNS system.

ACE

see A-label.

A-label

The ASCII-compatible encoded (ACE) representation of an internationalised domain name, i.e. how it is transmitted internally within the DNS protocol. A-labels always commence the with the prefix “xn--”. Contrast with U-label.

APIPA

A subcategory of private IP address. See Private IP Addresses.

AREG

A subset of IRIS for performing registration lookups on IP addresses.

.ARPA

Originally a reference to the US Government agency that managed some of the Internet’s initial development, now a top-level domain used solely for machine-readable use by computers for certain protocols — such as for reverse IP address lookups, and ENUM. The domain is not designed for general registrations. IANA manages .ARPA in conjunction with the Internet Architecture Board.

ASCII (American Standard Code for Information Interchange)

The standard for transmitting English (or “Latin”) letters over the Internet. DNS was originally limited to only Latin characters because it uses ASCII as its encoding format, although this has been expanded using Internationalised Domain Names for Applications.

ASCII-compatible encoding

see A-label.

ASN, AS number

see Autonomous System Number.

authoritative name server

a domain name server configured to host the official record of the contents of a DNS zone. Each domain name must have a set of these so computers on the Internet can find out the contents of that domain. The set of authoritative name servers for any given domain must be configured as NS records in the parent domain.

authority

see authoritative name server.

Automatic Private IP Addresses (APIPA)

A subcategory of private IP address that is automatically assigned, as per RFC 3927. See also Private IP addresses.

autonomous system number (AS number, ASN)

A number used by Internet routing protocols to uniquely identify the routing policy of a particular network operator. They can be considered to be similar to a ‘postcode’ used for physical mail. They are allocated to network operators via regional Internet registries.

bundle

see variant bundle.

caching name server

a domain name server that remembers the results of previous lookups in a cache to speed future lookups. Usually in combination with recursive name server functionality.

caching resolver

the combination of a recursive name server and a caching name server.

ccNSO

see Country-code Name Supporting Organisation.

ccTLD

see country-code top-level domain.

chain of trust

A property of an Internet resource where the delegation of responsibility from one party to another can be verified because there is a chain of custody that can be cryptographically verified using electronic certificates. To verify this chain of trust, the chain must be valid and unbroken all the way from a known trust anchor to the resource in question.

clandestine redelegation

The act of performing a redelegation by changing the practical details (i.e. the contact details and/or name server records) of a top-level domain subversively, rather than applying for a redelegation using proper procedure.

Country-code top-level domain (ccTLD)

A class of top-level domains only assignable to represent countries listed in the ISO 3166-1 standard. At present these are two-letter codes like “.UK”, “.DE” etc., however in the future it is expected there will be non-Latin equivalents also available. Much of the policy-making for individual country-code top-level domains is vested with a local sponsoring organisation, as opposed to other top-level domains where ICANN sets the policy. It is a requirement that ccTLDs are operated within the country they are designated so appropriate local laws, governments etc. have a say in how the domain is run.

Country-code Name Supporting Organisation (ccNSO)

A component of ICANN’s policy development forums (a “constituency”) that is responsible for discussing and developing policy relating to how ccTLDs are delegated.

CRISP

see Cross-Registry Information Service Protocol.

Cross-Registry Information Service Protocol (CRISP)

The name of the working group at the IETF that developed the Internet Registry Information Service (IRIS), a next-generation WHOIS protocol replacement.

DCHK

A subset of IRIS for performing checks on whether a domain name is available to register. It is more lightweight, and has less privacy implications, than DREG as it does not transmit registration data other than simple availability.

delegation

Any transfer of responsibility to another entity. In the domain name system, one name server can provide pointers to more useful name servers for a given request by returning NS records. On an administrative level, sub-domains are delegated to other entities. IANA also delegates IP address blocks to regional Internet registries.

DNS

See Domain Name System.

DNSSEC

A technology that can be added to the Domain Name System to verify the authenticity of its data. The works by adding verifiable chains of trust that can be validated to the domain name system.

DNS zone

a section of the Domain Name System name space. By default, the Root Zone contains all domain names, however in practice sections of this are delegated into smaller zones in a hierarchical fashion. For example, the “.COM” zone would refer to the portion of the DNS delegated that ends in “.COM”.

domain name

A unique identifier with a set of properties attached to it so that computers can perform conversions. A typical domain name is “icann.org”. Most commonly the property attached is an IP address, like “208.77.188.103”, so that computers can convert the domain name into an IP address. However the DNS is used for many other purposes. The domain name may also be a delegation, which transfers responsibility of all sub-domains within that domain to another entity.

domain name label

a constituent part of a domain name. The labels of domain names are connected by dots. For example, “www.iana.org" contains three labels — “www”, “iana” and “org”. For internationalised domain names, the labels may be referred to as A-labels and U-labels.

domain name registrar

An entity offering domain name registration services, as an agent between registrants and registries. Usually multiple registrars exist who compete with each other, and are accredited. For most generic top-level domains, domain name registrars are accredited by ICANN.

domain name registry

A registry tasked with managing the contents of a DNS zone, by giving registrations of sub-domains to registrants.

domain name server

A general term for a system on the Internet that answers requests to convert domain names into something else. These can be subdivided into authoritative name servers, which store the database for a particular DNS zone; as well as recursive name servers and caching name servers.

Domain Name System (DNS)

The global hierarchical system of domain names. A global distributed database contains the information to perform the domain name conversations, and the most central part of that database, known as the root zone is coordinated by IANA.

Domain Name System Root

see Root Zone.

dot [string]

common way of referring to a specific top-level domain. For example “dot info” refers to the “INFO” top-level domain. Written in text as “.INFO”.

DREG

A subset of IRIS for performing registration lookups on domain names.

eIANA

see RZM Automation.

E.164

see ENUM.

ENUM

A system of mapping telephone numbers (formally known as E.164 numbers after the telephone numbering standard) to Internet resources.

EPP

see Extensible Provisioning Protocol.

Extensible Markup Language

see XML.

Extensible Provisioning Protocol (EPP)

A protocol used for electronic communication between a registrar and a registry for provisioning domain names.

first come, first served (FCFS)

The principle of allocation of most Internet resources. It means that that assuming you meet any relevant qualifying criteria (such as meeting policy requirements, including possibly demonstrating need, and paying any relevant fees), you are allowed to register a given resource if you are the first one to lay claim to it. Most IANA registries are administered on a “first come, first served” basis.

fully-qualified domain name (FQDN)

A complete domain name including all its components, i.e. “www.icann.org" as opposed to “www”.

GAC Principles

A document, formally known as the Principles for the Delegation and Administration of ccTLDs. This document was developed by the ICANN Governmental Advisory Committee and documents a set of principles agreed by governments on how ccTLDs should be delegated and run. It is one of a number of documents considered when ICANN evaluates a ccTLD delegation request.

generic top-level domains (gTLDs)

A class of top-level domains that are used for general purposes, where ICANN has a strong role in coordination (as opposed to country-code top-level domains, which are managed locally). For policy reasons, these are usually subdivided into sponsored top-level domains and unsponsored top-level domains.

glue record

An explicit notation of the IP address of a name server, placed in a zone outside of the zone that would ordinarily contain that information. This is required because in some circumstances it would be impossible to find the name server otherwise, such as when the name server is in-bailiwick. All name servers are in-bailiwick of the Root Zone, therefore glue records is required for all name servers listed there. Also referred to as just “glue”.

hints file

A file stored in DNS software (i.e. recursive name servers) that tells it where the DNS root servers are located. Because the DNS is used to self-discover where its servers are located, this file is used to boot-strap the process when the DNS software knows nothing.

hostname

The name of a computer. Typically the left-most part of a fully-qualified domain name. The rules for what is a valid hostname are more strict than for domain names, and this can impact registration policy in some circumstances. The application of hostname rules is sometimes called “STD3” rules. Defined in technical standard RFC 1123.

IAB

See Internet Architecture Board.

IANA

See Internet Assigned Numbers Authority.

IANA Considerations

A component of RFCs that refer to any work required by IANA to maintain registries for a specific protocol.

IANA Contract

The contract between ICANN and the US Government that governs how the IANA functions are performed.

IANA Staff

see Internet Assigned Numbers Authority.

ICANN

See Internet Corporation for Assigned Names and Numbers.

ICP-1

A document written by IANA staff in 1999 describing how they manage top-level domains. Compare RFC 1591.

ICP-2

A document describing how new regional Internet registries may be created.

ICP-3

A document describing the requirement for a unique, authoritative DNS root zone. See also RFC 2826.

IDN

See Internationalised Domain Name.

IDNA

See Internationalised Domain Name.

IDN Table

A list of permissible Unicode code points allowed for registration in domain names by a registry. Usually, these are applied on a language or script basis.

IDN Practices Repository

A repository on IANA’s website where top-level domain registries contribute the IDN tables they use. This allows other registries to re-use the tables if they wish.

IESG

See Internet Engineering Steering Group.

IETF

See Internet Engineering Task Force.

in-bailiwick

when a domain name is a sub-domain of another, used for identifying whether a glue record is required. For example, “iana.org” is in the bailiwick of “org”. All domains are considered in-bailiwick of the DNS Root Zone.

infrastructure domain, infrastructure top-level domain

A term sometime used for “.ARPA” and its sub-domains, as it does not fit into the other categorisations of top-level domains.

internationalised domain name (IDN)

A domain name that uses characters outside the 37 characters allowed by the “LDH rule”, using a system known as IDNA. This allows for domain names in non-Latin scripts, such as Arabic, Japanese or Cyrillic.

Internationalised Domain Names in Applications (IDNA)

The Internet standard defining the encoding of internationalised domain names. The “in Applications” is in reference to the way the standard works, as the conversion happens in application software rather than in the network, and therefore does not affect the wire format of the DNS. The domains are internally coded in a special representation using the prefix “xn--”, known as an A-label. Described in Internet Standard RFC 3490.

Internet Architecture Board (IAB)

The oversight body of the IETF, responsible for overall strategic direction of Internet standardisation efforts. The IAB works with ICANN on how the IANA protocol parameter registries should be managed. The IAB is an activity of the Internet Society, a non-profit organisation.

Internet Assigned Numbers Authority (IANA)

A department of ICANN tasked with providing the functions described in a contract between ICANN and the US Government. The functions relate to ensuring globally-unique protocol parameter assignment, including management of the root of the Domain Name System and IP Address Space. ICANN staff within this department is often referred to as “IANA Staff”.

Internet Coordination Policy (ICP)

A series of documents created by ICANN between 1999 and 2000 describing management procedures. Three such documents were published before the numbering system stopped being used. Subsequent ICANN publications have not been given ICP numbers.

Internet Engineering Steering Group (IESG)

The committee of area experts of the IETF’s areas of work, that acts as its board of management.

Internet Engineering Task Force (IETF)

The key Internet standardisation forum. The standards developed within the IETF are published as RFCs. IANA’s protocol parameter registries are closely aligned with the work of the IETF.

.INT

A top-level domain devoted solely to international treaty organisations that have independent legal personality. Such organisations are not governed by the laws of any specific country, rather by mutual agreement between multiple countries. IANA maintains the domain registry for this domain.

Internet Protocol (IP)

The fundamental protocol that is used to transmit information over the Internet. Data transmitted over the Internet is transmitted using the Internet Protocol, usually in conjunction with a more specialised protocol. Computers are uniquely identified on the Internet using an IP Address.

Internet Protocol address

see IP Address.

Internet Registry Information Service (IRIS)

A sophisticated protocol for looking up registration data. It is designed to supplant the WHOIS protocol, by offering many technological improvements such as internationalisation, access control, automatic server discovery and structured formatting; however to date has not been adopted in any significant way. Documented in technical standard RFC 3981 and others.

Internet Telephony Administrative Domain (ITAD)

A unique numbering system used by Telephone Routing over Internet Protocol (TRIP) to label phone services within an organisation. A company may apply for an ITAD number to use in numbering systems without conflicting with other companies and users. See RFC 3219.

Interim Trust Anchor Repository (ITAR)

A proposed IANA service whereby the trust anchors for top-level domains can be listed separately from the DNS root zone. This is a temporary measure due to the inability to use DNSSEC to sign the root zone.

Internet standard

see protocol.

IP

see Internet Protocol.

IP address

A unique identifier for a device on the Internet. The identifier is used to accurately route Internet traffic to that device. IP addresses must be unique on the global Internet, although some are re-used within private networks using a system of private IP addresses and network address translation.

IP address block

A range of IP addresses that is assigned in a contiguous block. Usually the size of the range is described as the number of binary “bits” masked by the allocation. For example a “slash 24” or “/24” refers to a block of 256 IP addresses in IPv4.

IP address Space

The entire range of conceivable IP addresses. Managed by IANA, and generally delegated in blocks to Regional Internet Registries.

IPv4

Internet Protocol version 4. Refers to the version of Internet protocol that supports 32-bit IP addresses. This allows for approximately 4 billion unique IP addresses, which is not enough to cope with projected Internet demand in the next 5-10 years. Therefore, a new protocol called IPv6 has been developed that increases the number of possible IP addresses substantially.

IPv6

Internet Protocol version 6. Refers to the version of Internet protocol that supports 128-bit IP addresses. This protocol is not yet widely deployed, but allows for orders-of-magnitude more IP addresses than the more common IPv4 protocol.

IRIS

See Internet Registry Information Service

ISO

International Organisation for Standardisation. An international organisation comprised mostly of national standardisation agencies.

ISO 3166

A suite of international standards for labelling countries, territories, sub-national entities and former countries. Most notable, Part 1 of ISO 3166 (aka ISO 3166-1) is used by IANA to determine country-codes for top-level domains.

ISO 3166-1

A part of the ISO 3166 suite of standards describing two and three letters codes that represent countries. The two letter codes in ISO 3166-1 are used to determine the domains used for country-code top-level domains.

ISO 3166 Maintenance Agency (ISO 3166/MA)

The agency of ISO tasked with maintaining the ISO 3166 standard. It is responsible for any updates, for example, when a country is created or ceases to exist. ICANN is one of the ten members of the ISO 3166/MA.

ITAD

See Internet Telephony Administrative Domain.

ITAR

See Interim Trust Anchor Repository.

Jon Postel

see Postel, Jon.

label

see domain name label.

language table

see IDN table.

Letters-Digits-Hyphen (LDH)

The set of permissable characters in a domain label, when applying hostname rules.

local Internet community

The community of Internet users within a country who benefit from the country’s top-level domain. Country-code top-level domains are delegated to sponsoring organisations to operate domains in the best interests of this community, particularly by implementing policies the community has developed.

MIME type

A formalised text string that identifies the type of a file that is included in the headers of an email or web transmission. IANA maintains the registry of MIME types.

name server

See domain name server.

NAT

see Network Address Translation.

network address translation (NAT)

A system of using private IP addresses within an internal network (such as within a home, and office, or even within an ISP), and then having those numbers converted into a real IP address when Internet traffic leaves that network using a specialised router. This is commonly used within homes, for example, so that users do not have to apply for an extra IP address each time they connect a device to the network. It is very similar to using “extension numbers” within an office telephone system.

NS record

a type of record in a DNS zone that signifies part of that zone is delegated to a different set of authoritative name servers. Operators of domain names must have their authoritative name servers correctly listed in the parent domain.

number resources

Used to describe the hierarchically assigned number resources used for Internet routing, namely IP addresses and autonomous system numbers. These are usually distributed through regional Internet registries.

object identifier

see Private Enterprise Number.

OID

object identifier. See Private Enterprise Number.

parent domain

the domain above a domain in the DNS hierarchy. For all top-level domains, the Root Zone is the parent domain. The Root Zone has no parent domain as it is as the top of the hierarchy. Opposite of sub-domain.

PDP

See Policy Development Process.

PEN

see Private Enterprise Number.

Policy Development Process (PDP)

The formal policy creation process employed by ICANN by a number of its constituencies.

port number

A number used for identifying the type of Internet traffic being transmitted between two computers over the Internet. For example, the web uses port 80, DNS uses port 53, and email uses port 25. IANA assigns these numbers, and it is one of the more high profile protocol registries IANA maintains.

Postel, Jon

The progenitor of IANA. A computer scientist responsible for IANA until 1998, initially individually and later with other IANA staff within the University of Southern California. He was also responsible for the RFC Editor.

Principles for the Delegation and Administration of ccTLDs

See GAC Principles.

private enterprise numbers (PENs)

A unique numbering system used by several different Internet protocols (such as SNMP and LDAP) that use Abstract Notation Syntax One (ASN.1). It can be used to label services within an organisation. A company may apply for a private enterprise number to use these numbering systems without conflicting with other companies and users. A subset of numbers known as an Object Identifiers, or OIDs.

private IP addresses

A set of IP addresses only used within private networks, and therefore not reachable from the global Internet. Commonly used within home or office networks in conjuction with network address translation, which converts private IP addresses into a valid IP address when data leaves the local network. IANA maintains some special ranges of IP addresses solely for use as private IP addresses, as described in technical standards RFC 1918 and 3927.

protocol

Any form of inter-computer communication that has been standardised to ensure computers can communicate to one another. Internet protocols are usually standardised in RFCs.

protocol assignments

The assignment of protocol parameters by IANA.

protocol parameters

Unique systems of numbering or encoding used by a protocols that must be consistently applied for the protocols to be interoperable. The global unique assignment of protocol parameters is the task of IANA.

protocol registry

An individual protocol parameter registry managed by IANA, usually tied to a specific Internet standard.

PTR record

The representation of a IP address to domain name mapping in the DNS system.

recursive name server

A domain name server configured to perform DNS lookups on behalf of other computers. This is often configured at corporate network boundaries and ISPs for their network customers to use. As an individual domain name lookup can often involve multiple queries to different servers, these name servers do these iterative lookups and only provide back to the computer the final answer. They are often combined with the functions of a caching name server to improve network performance, and therefore are also known as caching resolvers.

redelegation

The transfer of a delegation from one entity to another. Most commonly used to refer to the redelegation process used for top-level domains.

Redelegation process

A special type of root zone change where there is a significant change involving the transfer of operations of a top-level domain to a new entity. Such a change must be evaluated by ICANN staff to ensure that the new entity meets a number of criteria, and must be voted on and agreed by the ICANN Board of Directors.

Regional Internet Registry (RIR)

A registry responsible for allocation of IP address resources within a particular region. There are five RIRs, and within each region network operators apply to their RIR to get IP address blocks allocated.

registrant

The entity that has acquired the right to use an Internet resource. Usually this is via some form of revocable grant given by a registrar to list their registration in a registry.

registrar

An entity that can act on requests from a registrant in making changes in a registry. Usually the registrar is the same entity that operates a registry, although for domain names this role is often split to allow for competition between multiple registrars who offer different levels of support. See also domain name registrar.

registry

1. The authoritative record of registrations for a particular set of data. Most often used to refer to domain name registry, but all protocol parameters that IANA maintains are also registries. 2. registry operator.

registry operator

The entity that runs a registry.

Request for Comments (RFC)

see RFCs.

reverse IP

A method of translating an IP address into a domain name, so-called as it is the opposite of a typical lookup that converts a domain name to an IP address. Utilises PTR records in the E164.ARPA zone for IPv4, and IP6.ARPA for IPv6.

RFCs

A series of Internet engineering documents describing Internet standards, as well as discussion papers, informational memorandums and best practices. Internet standards that are published in an RFC originate from the IETF. The RFC series is published by the RFC Editor.

RFC 812

See WHOIS.

RFC 954

See WHOIS.

RFC 1123

see hostname.

RFC 1591

A document written by IANA staff in 1994 describing how they manage top-level domains. The document is well-referenced as it describes some of the key principles that govern the appointment of country-code top-level domains. Compare ICP-1.

RFC 1918

See Private IP Addresses.

RFC 3912

See WHOIS.

RFC 3927

See Private IP Addresses.

RIR

see Regional Internet Registry.

root

the most central (or all-encompassing) authority of any naming or numbering system. Usually used to refer to the domain name system root (see Root Zone). However, IANA is also the root for IP addresses, and other systems.

Root Servers

the authoritative name servers for the Root Zone. These are considered unlike regular name servers in part because they are generally the most critical and heavily-used name servers. They are also special as they are not easily replaced, as changes to them needs to be stored in every name server worldwide in a hints file.

Root Zone

The top of the domain name system hierarchy. The root zone contains all of the delegations for top-level domains, as well as the list of root servers, and is managed by IANA.

Root Zone Management

The management of the DNS Root Zone by IANA.

RZM

see Root Zone Management.

RZM Automation

A project to automate many aspects of the Root Zone Management function within IANA. Based on a software tool originally called “eIANA”.

script table

see IDN table.

secure entry point (SEP)

synonym for trust anchor.

slash [number]

(e.g. /24) See IP address block.

sponsored top-level domain

a sub-classification of generic top-level domain, where there is a formal community of interest to domain is dedicated to serve.

sponsoring organisation

The entity acting as the trustee of a top-level domain on behalf of its designated community. Sponsoring organisations are not assigned ownership of a domain, rather, are custodians appointed by their local Internet community to act as proper stewards in that community’s best interests. The Sponsoring Organisation can generally be re-assigned if the local Internet community wishes using the redelegation process.

STD 3

see hostname.

sub-domain

A domain that resides within another domain. For example, “www.icann.org" is a sub-domain of “icann.org”, and “icann.org” is a sub-domain of “org”. Sub-domains are entrusted to other entities through a process of delegation.

TLD

see top-level domain.

top-level domain (TLD)

The highest level of subdivisions with the domain name system. These domains, such as “.COM” and “.UK” are delegated from the DNS Root zone. They are generally divided into two distinct categories, generic top-level domains and country-code top-level domains.

TRIP number

see Internet Telephony Administrative Domain (ITAD).

trust anchor

A known good cryptographic certificate that can be used to validate a chain of trust.

trust anchor repository (TAR)

Any repository of public keys that can be used as trust anchors for validating chains of trust. See Interim Trust Anchor Repository (ITAR) for one such repository for top-level domain operators using DNSSEC.

trustee

An entity entrusted with the operations of an Internet resource for the benefit of the wider community. In IANA circles, usually in reference to the sponsoring organisation of a top-level domain.

U-label

The Unicode representation of an internationalised domain name, i.e. how it is shown to the end-user. Contrast with A-label.

Unicode

A standard describing a repertoire of characters used to represent most of the worlds languages in written form. The collection of scripts used to do this is maintained by the Unicode Consortium and is constantly growing. Unicode is the basis for internationalised domain names.

unsponsored top-level domain

a sub-classification of generic top-level domain, where there is no formal community of interest.

UTF-8

A standard used for transmitting Unicode characters.

variant

In the context of internationalised domain names, an alternative domain name that can be registered, or mean the same thing, because some of its characters can be registered in multiple different ways due to the way the language works. Depending on registry policy, variants may be registered together in one block called a variant bundle. For example, “internationalise” and “internationalize” may be considered variants in English.

variant bundle

A collection of multiple domain names that are grouped together because some of the characters are considered variants of the others.

variant table

A type of IDN table that describes the variants for a particular language or script. For example, a variant table may map Simplified Chinese characters to Traditional Chinese characters for the purpose of constructing a variant bundle.

WHOIS

A simple plain text-based protocol for looking up registration data within a registry. Typically used for domain name registries and IP address registries to find out who has registered a particular resource. May also be used informally to refer to the database of registrants that a registry publishes over WHOIS, see WHOIS Database. Described in technical standard RFC 3912, formerly in RFCs 812 and 954. (Usage note: not “Whois” or “whois”)

WHOIS database

Used to refer to parts of a registry’s database that are made public using the WHOIS protocol, or via similar mechanisms using other protocols (such as web pages, or IRIS). Most commonly used to refer to a domain name registry’s public database.

WHOIS gateway

An interface, usually a web-based form, that will perform a look-up to a WHOIS server. This allows one to find WHOIS information without needing a specialised computer program that speaks the WHOIS protocol.

WHOIS protocol

see WHOIS.

WHOIS server

A system running on port number 43 that accepts queries using the WHOIS protocol.

wire format

The format of data when it is transmitted over the Internet (i.e. “over the wire”). For example, an A-label is the wire format of an internationalised domain name; and UTF-8 is a possible wire format of Unicode.

xn--

see A-label.

XML

A machine-readable file format for storing structured data. Used to represent web pages (in a subset called HTML) etc. Used by IANA for storing protocol parameter registries.



Removed UNICODE Glossary

IDNA2008 strove to make the Internet independent from the Unicode versions and, therefore, to reduce the impact of Unicode on the IDNA architecture and, therefore, on the Internet.

However,

  • some considered Unicode as central to their understanding of the Variants and of their algorithm.
  • that the Internet architecture is to be independent from Unicode variations does not mean that it could not capitalize on its accumulated observations and experience.
  • fructuous relations between the multilinguists' network systems area and the linguists' language and scripts area called for the bridge of polynym terms.

This is why the possible integration of some definitions of the Unicode Glossary was perceived as important and we wanted to publicly work on the reduction of the possible distance between the Unicode glossary, as for any SDO's glossary, with the IUsers' understanding and own glossary. However, it turns out that Unicode wants to thwart people who are subject to its tables and policy from trying to understand them.

Quoting a document in order to conduct research and comment on it falls under "fair use" and is accepted worldwide. However, this is not the case with Unicode, which shut down our site in stating to our hosting company:

"I have a good faith belief that use of the copyrighted materials described above as allegedly infringing is not authorized by the copyright owner, its agent, or the law.
I swear, under penalty of perjury, that the information in the notification is accurate and that I am the copyright owner or am authorized to act on behalf of the owner of an exclusive right that is allegedly infringed."

What is particularly worrying is that:

  • the quoted document is a glossary where people are supposed to not set out to invent on their own but rather to tell the truth. This means that either Unicode claims copyrights on the truth (hence their UniGod nickname), or that their "truth" is not the common truth.
  • as Internet Users, we are subject to the influence (RFC 3935) of several RFCs that trust the Unicode documentation of which we are not allowed to jointly explore, discuss, or learn and, therefore, use.
  • we looked for some of the definitions. Google found them in other more documented (and also copyrighted) texts. We are users, not lawyers.


We, therefore, had to decide if we would or would not consider the Unicode contribution as a reliable source for an involved Internet users’ glossary. We had started working on the Unicode Glossary, making this part of the IUWW very different from the Unicode copyrighted page (in particular, starting the translation of it into French). Then, we had to review the "algorithm" definition that is central for us and an important item for the ICANN/VIP variants discussion.

The Unicode definition is (there is no direct link to the sole definition, so we will quote it): "A term used in a broad sense in the Unicode Standard, to mean the logical description of a process used to achieve a specified result. This does not require the actual procedure described in the algorithm to be followed; any implementation is conformant as long as the results are the same."

We noted that this departed from the usual definition of algorithms, as pertains to what a Turing Machine can execute. Our own understanding is that an algorithm is a logical suite of qualitative, quantitative, and significative functions that a Turing machine can execute in the quantitative mathematic area and other equivalent machines that are still to be defined could execute in the pragmatic and semantic areas. The debate here has shown that the Unicode definition of what we consider the a scientific and cultural point is too far from our own understanding. We then identified the fact that from this difference actually came our two main problems with Unicode: homography and orthotypography.

We, therefore, realized that if the US Unicode consortium does not respect our work, they are unfortunately also not interested in respecting our needs.

Personal tools