[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
CVS libidn/doc/specifications
From: |
libidn-commit |
Subject: |
CVS libidn/doc/specifications |
Date: |
Wed, 17 Nov 2004 22:42:59 +0100 |
Update of /home/cvs/libidn/doc/specifications
In directory dopio:/tmp/cvs-serv3574
Added Files:
draft-klensin-reg-guidelines-05.txt
Log Message:
Add.
--- /home/cvs/libidn/doc/specifications/draft-klensin-reg-guidelines-05.txt
2004/11/17 21:42:59 NONE
+++ /home/cvs/libidn/doc/specifications/draft-klensin-reg-guidelines-05.txt
2004/11/17 21:42:59 1.1
Network Working Group J. Klensin
Internet-Draft November 16, 2004
Expires: May 17, 2005
Suggested Practices for Registration of Internationalized Domain
Names (IDN)
draft-klensin-reg-guidelines-05.txt
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of section 3 of RFC 3667. By submitting this Internet-Draft, each
author represents that any applicable patent or other IPR claims of
which he or she is aware have been or will be disclosed, and any of
which he or she become aware will be disclosed, in accordance with
RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on May 17, 2005.
Copyright Notice
Copyright (C) The Internet Society (2004).
Abstract
Registration of names in the DNS has traditionally involved
restrictions beyond those very few required by the DNS itself, both
to match requirements imposed by applications (the "hostname" rules)
and avoid confusion among similar-looking names. With the
introduction of internationalized domain names (IDNs), the standards
make a much larger number of characters available for inclusion in
domain names as they appear to the user. Concerns similar to those
Klensin Expires May 17, 2005 [Page 1]
Internet-Draft IDN Registration November 2004
caused the original adoption of the hostname rules have led several
groups to explore models for appropriate restrictions by registries
on the IDNs that can be registered and whether some combinations of
names should be restricted even if the characters in them are valid.
This document explores those issues, including how methods developed
for Chinese, Japanese, and Korean can be adapted to other languages
and scripts, and makes suggestions for mechanisms registries might
use to define and implement such rules.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Nature and Status of these Recommendations . . . . . . 4
1.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Languages and Scripts . . . . . . . . . . . . . . . . 4
1.3.2 Characters, Variants, Registrations, and Other
Issues . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Confusion, Fraud, and Cybersquatting . . . . . . . . . 7
1.4 A Review of the JET Guidelines . . . . . . . . . . . . . . 7
1.4.1 JET Model . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Reserved Names and Label Packages . . . . . . . . . . 8
1.5 Languages, Scripts, and Variants . . . . . . . . . . . . . 9
1.5.1 Languages and Scripts . . . . . . . . . . . . . . . . 9
1.5.2 Variant Selection . . . . . . . . . . . . . . . . . . 10
1.6 Variants are not a Universal Remedy . . . . . . . . . . . 12
1.7 Reservations and Exclusions . . . . . . . . . . . . . . . 12
1.7.1 Sequence Exclusions for Valid Characters . . . . . . . 12
1.7.2 Character Pairing Issues . . . . . . . . . . . . . . . 13
1.8 The Registration Bundle . . . . . . . . . . . . . . . . . 13
1.8.1 Definitions and Structure . . . . . . . . . . . . . . 13
1.8.2 Application of the Registration Bundle . . . . . . . . 13
2. Some Implications of This Approach . . . . . . . . . . . . . . 14
3. Required Modifications to JET Model Needed Under Some of
the Models Above . . . . . . . . . . . . . . . . . . . . . . . 15
4. Conclusions and Recommendations About the General Approach . . 16
5. A Model Table Format . . . . . . . . . . . . . . . . . . . . . 17
6. A Model Label Registration Procedure: "CreateBundle" . . . . . 18
6.1 Description of the CreateBundle Mechanism . . . . . . . . 18
6.2 The "no-variants" Case . . . . . . . . . . . . . . . . . . 19
6.3 CreateBundle and Nameprep Mapping . . . . . . . . . . . . 20
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
8. Security Considerations . . . . . . . . . . . . . . . . . . . 21
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
Author's Address . . . . . . . . . . . . . . . . . . . . . . . 24
Intellectual Property and Copyright Statements . . . . . . . . 25
Klensin Expires May 17, 2005 [Page 2]
Internet-Draft IDN Registration November 2004
1. Introduction
1.1 Background
Once work on the basic model for encoding non-ASCII strings in the
DNS with IDNA ([RFC3490], [RFC3491], [RFC3492]) was nearing
completion, it became clear that it would be desirable for registries
to impose additional restrictions on the names that could actually be
registered (e.g., see [IESG-IDN] and [ICANN-IDN]) as a means of
reducing potential confusion among characters that were similar in
some way. These restrictions were, in many respects, part of a long
tradition. For example, while the original DNS specifications
[RFC1035] permitted any string of octets to be used in a DNS label,
they also recommended the use of a much more restricted subset, one
that was derived from the much older "hostname" rules [RFC0952] and
defined by the "LDH" convention (for the three permitted types of
characters, letters, digits, and the hyphen). Enforcement of those
restricted rules in registrations was the responsibility of the
registry or domain administrator. They were not embedded in the DNS
protocol itself, although some applications protocols, notably those
concerned with electronic mail, did impose and then enforce similar
rules.
If there are no constraints on registration in a zone, people can
register characters that increase the risk of misunderstandings,
cybersquatting, and other forms of confusion. That a similar
situation existed even before the introduction of IDNA is exemplified
by domain names such as example.com and examp1e.com (note that the
latter domain contains the digit "1" instead of the letter "l").
For non-ASCII names (so-called "internationalized domain names" or
"IDNs"), the problem was more complicated than that which led to the
LDH (hostname) rules. In the earlier situation, all protocols,
hosts, and DNS zones used ASCII exclusively in practice, so the LDH
restriction could reasonably be applied uniformly across the
Internet. With the introduction of a very large character
repertoire, and with different geographical and political locations
and languages having requirements for different collections of
characters, the optimal registration restrictions became, not a
global matter, but ones that were different in different areas and,
hence, in different DNS zones.
For some human languages, there are characters and/or strings that
have equivalent or near-equivalent usages. If someone is allowed to
register a name with such a character or string, the registry might
want to automatically associate all of the names that have the same
meaning with the registered name. The registry might also decide
whether the names that are associated with, or generated by, one
Klensin Expires May 17, 2005 [Page 3]
Internet-Draft IDN Registration November 2004
registration should, as a group or individually, go into the zone or
be blocked from registration by different parties.
To date, the best-developed system for handling registration
restrictions for IDNs is the JET Guidelines for Chinese, Japanese,
and Korean [RFC3743], the so-called "CJK" languages. That system is
limited to those languages and, in particular, to their common script
base. Those languages are also the best-known and most widely-used
ones in the world whose writing system is constructed on
"ideographic" or "pictographic" principles. This document explores
the principles behind the JET guidelines. It then examines some of
the issues that might arise in trying to adapt them to alphabetic
languages, i.e., ones who characters primarily represent sounds,
rather than meanings.
This document describes five things:
1. The general background and considerations for non-ASCII scripts
in names. Just as the JET Guidelines contain some suggestions
that may not be applicable to alphabetic scripts, some of the
suggestions here, especially the more specific ones, may be
applicable to some scripts and not others
2. Suggested practices for describing character variants
3. A method for using a zone's character variants to determine which
names should be associated with a registration
4. A format for publishing a zone's table of character variants.
Such tables are referred to below simply as "the table".
5. A model algorithm for name registration given the presence of
language tables.
1.2 The Nature and Status of these Recommendations
The document makes recommendations for consideration by registries
and, where relevant, those who coordinate them and use their
services. None of the recommendations are intended to be normative.
Indeed, the intent of the document is to illustrate a framework from
which variations to meet the needs of particular registries and their
processing of particular languages can be developed. Of course, if
registries make similar decisions and utilize similar tools, it may
reduce costs and confusion -- both between registries and for users
and registrars who have relationships with more than one domain.
1.3 Terminology
1.3.1 Languages and Scripts
This document uses the term "language" in what may be, to many
readers, an odd way. Neither this specification, nor IDNA, nor the
Klensin Expires May 17, 2005 [Page 4]
Internet-Draft IDN Registration November 2004
DNS are directly concerned with natural language, but only about the
characters that make up a given label. In some respects, the term
"script", as used in the character coding community, might be more
appropriate. However, different subsets of the same script may be
used with different languages and the same language may be written
using different characters (or even completely different scripts) in
different locations, so that term is not precisely correct either.
Long-standing confusion has also resulted by the fact that most
scripts are, informally at least, named after one of the languages
written in them: "Chinese" describes both a language and a collection
of characters also used in writing Japanese, Korean, and, at least
historically, some other languages; "Latin" describes both a
language, the characters used to write that language, and, often
characters used to write a number of contemporary languages that are
derived from or similar to those used to write Latin; the script used
to write the Arabic language is called "Arabic" but is also used
(typically with some additions or deletions) to write a number of
other languages, and so on. Situations in which a script has a
clearly-defined name independent of the name of a language are the
exception, rather than the rule; examples include Hangul, used to
write Korean, Katakana and Hiragana, used to write Japanese, and a
few others. And some scholars have historically used "Roman" or
"Roman-derived" in an attempt to distinguish between a script and the
Latin language.
The term "language" is hence used in this document in the informal
sense of a written language and is defined, for this purpose, by the
characters used to write it. In this context, a "language" is
defined by the combination of a code (see Section 1.4.1) and an
authority that has chosen to use that code and establish a
character-listing for it. Authorities are normally TLD registries
(see Section 7 and [IANA-language-registry]), but it is expected that
they will find appropriate experts and that advice from language and
script experts selected by international neutral bodies will also
become part of the registration system. In addition, as discussed
below in Section 7, registries may conclude that the best interests
of registrants, stakeholders, and the Internet community would be
served by constructing "language tables" that mix scripts and
characters in ways that conform to no known language. Conventions
should be developed for such registrations that do not misleadingly
reflect specific language codes.
1.3.2 Characters, Variants, Registrations, and Other Issues
1. Characters in this document are given as their Unicode codepoints
in U+xxxx format, with their official names, or both.
Klensin Expires May 17, 2005 [Page 5]
Internet-Draft IDN Registration November 2004
2. The following terms are used in this document.
* A "string" is an sequence of one or more characters.
* This document discusses characters that may have equivalent or
near-equivalent characters or strings. The "base character"
is the character that has zero or more equivalents. In the
JET Guidelines, base characters are referred to as "valid
characters". In a table with variants, as described in
Section 5, the base characters occupy the first column.
Normally (and always if the recommendation of Section 6.3 is
adopted) the base characters will be the characters that
appear in registration requests from registrants; all other
character will be considered to make the registration attempt
invalid.
* The "variant(s)" are the character(s) and/or string(s) that
are treated as equivalent to the base character. Note that
these might not be true equivalent characters: a particular
original character may be a base character with a mapping to a
particular variant character, but that variant character may
not have a mapping to the original base character and, indeed,
the variant character may not appear in the base character
list, and hence may not be valid for use in a registration.
Usually, characters or strings to be designated as variants
are considered either equivalent or sufficiently similar (by
some registry-specific definition) that confusion between them
and the base character might occur.
* The "base registration" is the single name that the registrant
requested from the registry. The JET Guidelines use the term
"label string" for this name.
* A label (or "name") is described as "registered" if it is
actually entered into a domain (i.e., a zone file) by the
registry, so that it can be accessed and resolved using
standard DNS tools. The JET Guidelines describe a
"registered" label as "activated".
* A "registration bundle" is the set of all labels that comes
from expanding the base characters for a single name into
their variants. The presence of a label in a registration
bundle does not imply that it is registered. In the JET
Guidelines, a registration bundle is called an "IDN Package".
* A "reserved label" is a label in a registration bundle that is
not actually registered.
* A "registry" is the administrative authority for a DNS zone.
That is, the registry is the body that enforces, and typically
makes, policies that are used in a particular zone in the DNS.
* "Coded Character Set" ("CCS") is a term for a list of
characters and the code positions assigned to them. ASCII and
Unicode are CCSs.
Klensin Expires May 17, 2005 [Page 6]
Internet-Draft IDN Registration November 2004
* A "language" is something spoken by humans, independent of how
it is written or coded. ISO Standard 639 and IETF BCP 47 (RFC
3066) [RFC3066] list and define codes for identifying
languages.
* A "script" is a collection of characters (glyphs, independent
of coding) that are used together, typically to represent one
or more languages. Note that the script for one language may
heavily overlap the script for another. This does not imply
that they have identical scripts.
* "Charset" is an IETF-invented term to describe, more or less,
the combination of a script, a CCS that encodes that script,
and rules for serializing the bytes when those are stored on a
computer or transmitted over the network.
The last four of these definitions are redundant with, but
deliberately somewhat less precise than, the definitions in
[RFC3536], which also provides sources. The two sets of definitions
are intended to be consistent.
1.3.3 Confusion, Fraud, and Cybersquatting
The term "confusion" is used very generically in this document to
cover the entire range from accidental user misperception of the
relationship between characters with some characteristic in common
(typically appearance, sound, or meaning) to cybersquatting and
[other] deliberate fraudulent attempts to exploit those relationships
or others based on the nature of the characters.
1.4 A Review of the JET Guidelines
1.4.1 JET Model
In the JET Guidelines model, a prospective registrant approaches the
registry for a zone (perhaps through an intermediate registrar) with
a candidate base registration -- a proposed name to be registered --
and a list of languages in which that name is to be interpreted. The
languages are defined according to the fairly high-resolution coding
of [RFC3066] -- Chinese as used on the mainland of the People's
Republic of China ("zh-cn") can, at registry option, consist of a
somewhat different list of characters (code points) and be
represented by a separate table compared to Chinese as used in Taiwan
("zh-tw").
The design of the JET Guidelines took one important constraint as a
basis: IDNA was treated as a firm standard. A procedure that
modified some portion of the IDNA functions, or was a variant on
them, was considered a violation of those standards and should not be
encouraged (or, probably, even permitted).
Klensin Expires May 17, 2005 [Page 7]
Internet-Draft IDN Registration November 2004
Each registry is expected to construct (or obtain) a table for each
language it considers relevant and appropriate. These tables list,
for the particular zone, the characters permitted for that language.
If a character does not appear as a "valid code point" (called a
"base character" in the rest of this document) in that table, then a
name containing it cannot be registered. If multiple languages are
listed for the registration, then the character must appear in the
[997 lines skipped]
- CVS libidn/doc/specifications, libidn-commit, 2004/11/07
- CVS libidn/doc/specifications, libidn-commit, 2004/11/08
- CVS libidn/doc/specifications, libidn-commit, 2004/11/09
- CVS libidn/doc/specifications, libidn-commit, 2004/11/09
- CVS libidn/doc/specifications,
libidn-commit <=
- CVS libidn/doc/specifications, libidn-commit, 2004/11/18
- CVS libidn/doc/specifications, libidn-commit, 2004/11/30