help-libidn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: libidn2 0.13


From: Simon Josefsson
Subject: Re: libidn2 0.13
Date: Sun, 8 Jan 2017 14:28:33 +0100

Den Sat, 07 Jan 2017 19:48:42 +0100
skrev Re: libidn2 0.13:

> On Dienstag, 3. Januar 2017 10:00:53 CET Nikos Mavrogiannopoulos
> wrote:
> > On Mon, Jan 2, 2017 at 10:17 PM, Tim Rühsen <address@hidden>
> > wrote:
> > >> * APIs more like libidn's that take a full domain name and do
> > >> proper
> > >> 
> > >>   operations on them.  In several forms, UTF-8, USC-32, locale
> > >> encoded, etc.
> > >> 
> > >> * APIs to decode a IDNA2008 domain from ACE to Unicode format.
> > >> That is
> > >> 
> > >>   not described by the IDNA2008 RFCs, interestingly enough, but I
> > >>   suspect people will want it, hah!
> > > 
> > > Wget used to use ACE decoding from libidn, but only for
> > > logging/displaying purpose. Since we switched to libidn2, the
> > > UTF-8/locale named will not be displayed any more :-). With such
> > > a function I would reactivate the logging
> > > code.
> > 
> > For gnutls unfortunately the reverse is really necessary and that's
> > the reason we are stuck with libidn. We need to be able to print the
> > actual name of the certificate and not only the punycode which is
> > non-human readable for most languages.
> 
> Than let's define a function.
> 
> Let me start with a suggestion to get the ball rolling
>       int idn2_fromASCII (const uint8_t *src, uint8_t **dst)
> 
> 'src' is an UTF-8 encoded string (domain name)
> 'dst' is the punycode-decoded output, also UTF-8.

How about copying the libidn APIs?  Here are the low-level per-label
primitives:

  /* Core functions */
  extern IDNAPI int idna_to_ascii_4i (const uint32_t * in, size_t inlen,
                                      char *out, int flags);
  extern IDNAPI int idna_to_unicode_44i (const uint32_t * in, size_t
  inlen, uint32_t * out, size_t * outlen,
                                         int flags);

The idna_to_ascii_4i call is roughly equivalent to idn2_lookup.
idna_to_unicode doesn't exist in libidn2.

Then the interesting APIs for applications:

  extern IDNAPI int idna_to_ascii_4z (const uint32_t * input,
                                      char **output, int flags);

  extern IDNAPI int idna_to_ascii_8z (const char *input, char **output,
                                      int flags);

  extern IDNAPI int idna_to_ascii_lz (const char *input, char **output,
                                      int flags);

  extern IDNAPI int idna_to_unicode_4z4z (const uint32_t * input,
                                          uint32_t ** output, int
  flags);

  extern IDNAPI int idna_to_unicode_8z4z (const char *input,
                                          uint32_t ** output, int
  flags);

  extern IDNAPI int idna_to_unicode_8z8z (const char *input,
                                          char **output, int flags);

  extern IDNAPI int idna_to_unicode_8zlz (const char *input,
                                          char **output, int flags);

  extern IDNAPI int idna_to_unicode_lzlz (const char *input,
                                          char **output, int flags);


Mimicking these APIs are probably what's interesting.

I have mixed feelings about exposing LOOKUP vs REGISTER as separate
APIs. How about using a FLAGS to select REGISTER functionality?  Most
applications will want to use LOOKUP, REGISTER is uncommon.  I think it
is wasteful to burn a separate API point for the REGISTER functionality.

> Examples:
> foo.bar -> foo.bar
> übel.de -> übel.de
> xn--bel-goa.de -> übel.de
> xn--bel-goa.größer.de -> übel.größer.de

Depending on of IDNA2003 vs IDNA2008, TR46, transitional and
non-transitional, phase of the moon, and so on, of course.

> Casing: we leave input as it is - only domain labels that start with
> xn-- will be converted without any casing check.
> 
> Why utf-8 and utf-8 ?
> - Most applications internally work already with UTF-8.
> - It is easy to convert to utf-16/utf-32 (ucs2/ucs4).
> - Leave charset transcoding out of the library
> - ...

I'd say most applications actually don't care about
encoding -- they use strings in Unix locale encoded
format.  Locale-encoded APIs are easy if we link to libiconv anyway.
USC-4 is also easy to provide for completeness.  UTF-8 is simple se we
use it internally so much.

> Do we need an additional 'flags' for future use ? Why not.

Indeed.

> If we want charset transcoding, we also need input and output
> charset, maybe also language (e.g. think of turkish i/I casing). Do
> we want that ?

I don't recall anyone requesting that from libidn -- and it is possible
to do that in the application (like the "idn" command line tool does).
Very few applications deal with multiple charsets natively.  The ones
who do often wants to do the conversion internally.

I think it makes sense to only focus on UTF-8, UCS-4, and
locale-encoded strings.  For the majority of Unix applications, they
would want to use the locale-encoded API.  Few will want UTF-8 if they
are UTF-8-pure applications.  A couple will prefer UCS-4.

/Simon

Attachment: pgpYGZIg49hLl.pgp
Description: OpenPGP digital signatur


reply via email to

[Prev in Thread] Current Thread [Next in Thread]