guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)


From: Ludovic Courtès
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Wed, 03 Apr 2013 13:58:50 +0200
User-agent: Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.3 (gnu/linux)

Hello, Mark!

Mark H Weaver <address@hidden> skribis:

> * All kinds of streams are supported in a uniform way: files, pipes,
>   sockets, terminals, etc.
>
> * As specified in Unicode 6.2, BOMs are only handled specially at the
>   start of a stream, and only if the encoding is set to "UTF-16" or
>   "UTF-32".  BOMs are *not* handled specially if the encoding is set to
>   "UTF-16LE", etc.

OK.

> * This code never tries to read a BOM until the user has asked to read.
>   If the user writes before reading, it chooses big-endian and writes a
>   BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").
>
> * The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
>   because BOM handling varies between iconv implementations.  Creation
>   of the iconv descriptors is always postponed until the first read or
>   write, at which point a decision is made about the endianness, and
>   then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
>   iconv.
>
> * If 'rw_random' is zero, then the input and output streams are
>   considered independent: the first read will consume a BOM if
>   appropriate, *and* the first write will produce a BOM if appropriate.
>
> * If 'rw_random' is non-zero, then the input and output streams are
>   considered linked: if the user reads first, then a BOM will be
>   consumed if appropriate, but later writes will *not* produce a BOM.
>   Similarly, if the user writes first, then later reads will *not*
>   consume a BOM.
>
> * If 'set-port-encoding!' is called in the middle of a stream, it treats
>   it as a new logical "start of stream", i.e. if the encoding is set to
>   "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
>   read and/or produced the next time you write.
>
> * Seeks to the beginning of the file set the "start of stream" flags.
>   Seeks anywhere else clear the "start of stream" flags.

Woow, well thought out.  The semantics seem good.  (It’s interesting to
see how BOMs complicate things, but that’s life, I guess.)

The patch looks good to me.  The test suite is nice.  It doesn’t seem to
cover all the corner cases listed above, but that can be added later on
perhaps?

Perhaps the text above could be added to the manual, in a
@ununnumberedsec or something?

Remarks:

> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
> index 73a788f..cd1746b 100644
> --- a/libguile/ports-internal.h
> +++ b/libguile/ports-internal.h
> @@ -48,14 +48,19 @@ struct scm_port_internal
>  {
>    scm_t_port_encoding_mode encoding_mode;
>    scm_t_iconv_descriptors *iconv_descriptors;
> +  int at_stream_start_for_bom_read;
> +  int at_stream_start_for_bom_write;

Add “:1”?

> +#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */

0xfeffUL to be on the safe side.

> +/* If the next LEN bytes from port are equal to those in BYTES, then

s/port/PORT/

> +   return 1, else return 0.  Leave the port position unchanged.  */
> +static int
> +looking_at_bytes (SCM port, unsigned char *bytes, int len)

const unsigned char *bytes

> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */
> +static char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)

static const char *

> +static char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)

Likewise.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
already, but I think it’s a mistake.)

> +      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
> +                        || strcmp(pt->encoding, "UTF-32") == 0))

Likewise, + space before paren.

Thanks!

Ludo’.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]