[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8-strings

From: David Kastrup
Subject: Re: utf-8-strings
Date: Sun, 08 Jul 2012 11:47:29 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.1.50 (gnu/linux)

Thomas Morley <address@hidden> writes:

> Hi,
> together with Arnold I worked on a method how to compress or stretch a
> text, limiting it to the space between characters, i.e. the letters
> itself shouldn't be scaled.
> (Comes out of a discussion at the german LilyPond-Forum:
> )
> The difficulty is to achieve a functionality which turns a string into
> a list of single strings and works with accented letters, german
> Umlaute, non-europian fonts etc.
> p.e.:
> "áèçäöüテスト" → '("á" "è" "ç" "ä" "ö" "ü" "テ" "ス" "ト")
> We're coming up with the attached code.
> Problems:
> UNICODE is increasing, so the code needs updating from time to time.
> Once LilyPond uses guile 2.0 the situation may be completely
> different. (I've not a clue about guile 2.0)
> What do you think?
> Or let me ask different: Are there any objections to turn it into a
> patch?

Several observations:

a) guilev2 is going to become a definite issue this year.  We may either
   decide to support both guilev1 or guilev2, or ditch guilev1 support

   So it does not make sense to design a solution that is not easy to
   support with guilev2.

b) LilyPond's lexer goes to considerable length to not let any invalid
   utf8 pass into strings.  It would be reasonably straightforward, if
   required, to make sure that this also holds for embedded Scheme.  In
   that case, the only way to arrive at invalid utf-8 would be
   synthesizing strings in Scheme from bytes.  So I'd not bother about
   invalid utf-8.  This means that, diacriticals apart, you can just
   split the string before any byte outside the range 80-bf.

This can basically be done using charsets.  I tried doing this with
regexps, but curiously enough, in contrast to Guile proper, those appear
to be already utf-8 aware, so

#(use-modules (ice-9 regex))

#(define (utf8-substrings str)
   (define char-pat (make-regexp "."))
   (map match:substring (list-matches char-pat str)))

#(write (utf8-substrings "áèçäöüテスト"))

works just fine (if you overlook the fact that write misbehaves, writing
some byte codes quoted as \xhh inside of a string and others literally).

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]