[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18520: string ports should not have an encoding

From: David Kastrup
Subject: bug#18520: string ports should not have an encoding
Date: Tue, 23 Sep 2014 15:02:54 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

address@hidden (Ludovic Courtès) writes:

> David Kastrup <address@hidden> skribis:
>> address@hidden (Ludovic Courtès) writes:
>>> David Kastrup <address@hidden> skribis:
>>>>> Line/column info remains identical regardless of the encoding, so I tend
>>>>> to think it’s more robust to use that.
>>>> Column info remains identical regardless of the encoding?  Since when?
>>> The character on line L and column M is always there, regardless of
>>> whether the file is encoded in UTF-8, Latin-1, etc.
>>> Would that work for LilyPond?
>> Last time I looked, in the following line x was in column 3 in latin-1
>> encoding and in column 2 in utf-8 encoding:
>> üx
> I’m not sure what you mean.  This line contains two characters: ‘u’ with
> umlaut followed by ‘x’.  ‘ü’ is in the first column, and ‘x’ in the
> second column.

It contains three bytes. 0xc3, 0xbc, 0x78.  In utf-8, this is üx, in
Latin-1 it is üx.

This whole issue is about string ports _not_ being represented in terms
of characters but bytes.

> Is there a simple way to reproduce the issue with LilyPond?

This issue is at best marginally about LilyPond, in that the semantics
chosen for GUILE-2.0 (and switched again in GUILE-2.2) are both
surprising and a source for headaches.

They result in code like

  // we do our own utf8 encoding and verification in the parser, so we
  // use the no-conversion equivalent of latin1
  SCM str = scm_from_latin1_string (c_str ());
  scm_dynwind_begin ((scm_t_dynwind_flags)0);
  // Why doesn't scm_set_port_encoding_x work here?
  scm_dynwind_fluid (ly_lily_module_constant ("%default-port-encoding"), 
  str_port_ = scm_open_input_string (str);
  scm_dynwind_end ();
  scm_set_port_filename_x (str_port_, ly_string2scm (name_));

which will, incidentally, stop working in GUILE-2.2 at which time
another workaround will be found.

GUILE is an extension language.  The stance that any kind of dealing
with characters/strings that is not under control of GUILE and its
character model is simply inappropriate.  It is not the job of GUILE to
dictate how an application has to organize matters internally.  For that
reason, its behavior needs to be straightforward and unsurprising.  That
includes sane boundaries between strings as character vectors, byte
vectors, and encoding and decoding operations.  Going through a
byte-based encoding when copying a character-based string to a string,
even when going through a string port, does not make sense.

As a sign that this does not make sense, the effects of
%default-port-encoding and set-port-encoding! on input and output string
ports are unsymmetric.  More so in GUILE-2.2 than in GUILE-2.0, but
already in GUILE-2.0.

That inconsistency (and its effects on overall performance) is what this
issue is about.  That I am tripping all over GUILE in the course of
working with LilyPond is at best incidental to this issue.  I could
equally well be tripping over it when working with TeXmacs.

I am not going to further reply to this issue since this is _not_,
I repeat _not_ some complaint that I am too stupid to understand what
GUILE is doing here.  I understand it perfectly well, and I am perfectly
able to hack around GUILE's deficiencies and inconsistencies.  One
consequence of design problems like this is that the chosen semantics
under such a fundamental design problem are arbitrary and thus more
likely to change to different semantics in future versions.  That means
a higher likelihood of future maintenance.  When I am going to have to
redo this for GUILE-2.2 anyway, I prefer doing it in a sane manner that
will stick around for good.

I don't see that here.  That does not mean that I am too stupid to work
with the GUILE 2.0 behavior or the GUILE 2.2 behavior or the GUILE 1.8
behavior (in fact, the first port to GUILE 2 will set LC_CTYPE to C and
just stick with GUILE 1.8 behavior, but that's not a long-term
perspective since working with characters rather than bytes as string
constituents _is_ nicer for the user).

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]