[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8-strings

From: Thomas Morley
Subject: Re: utf-8-strings
Date: Sun, 8 Jul 2012 13:39:28 +0200

2012/7/8 David Kastrup <address@hidden>:
> Thomas Morley <address@hidden> writes:
>> Hi,
>> together with Arnold I worked on a method how to compress or stretch a
>> text, limiting it to the space between characters, i.e. the letters
>> itself shouldn't be scaled.
>> (Comes out of a discussion at the german LilyPond-Forum:
>> )
>> The difficulty is to achieve a functionality which turns a string into
>> a list of single strings and works with accented letters, german
>> Umlaute, non-europian fonts etc.
>> p.e.:
>> "áèçäöüテスト" → '("á" "è" "ç" "ä" "ö" "ü" "テ" "ス" "ト")
>> We're coming up with the attached code.
>> Problems:
>> UNICODE is increasing, so the code needs updating from time to time.
>> Once LilyPond uses guile 2.0 the situation may be completely
>> different. (I've not a clue about guile 2.0)
>> What do you think?
>> Or let me ask different: Are there any objections to turn it into a
>> patch?
> Several observations:
> a) guilev2 is going to become a definite issue this year.  We may either
>    decide to support both guilev1 or guilev2, or ditch guilev1 support
>    completely.
>    So it does not make sense to design a solution that is not easy to
>    support with guilev2.
> b) LilyPond's lexer goes to considerable length to not let any invalid
>    utf8 pass into strings.  It would be reasonably straightforward, if
>    required, to make sure that this also holds for embedded Scheme.  In
>    that case, the only way to arrive at invalid utf-8 would be
>    synthesizing strings in Scheme from bytes.  So I'd not bother about
>    invalid utf-8.  This means that, diacriticals apart, you can just
>    split the string before any byte outside the range 80-bf.
> This can basically be done using charsets.  I tried doing this with
> regexps, but curiously enough, in contrast to Guile proper, those appear
> to be already utf-8 aware, so
> #(use-modules (ice-9 regex))
> #(define (utf8-substrings str)
>    (define char-pat (make-regexp "."))
>    (map match:substring (list-matches char-pat str)))
> #(write (utf8-substrings "áèçäöüテスト"))
> works just fine (if you overlook the fact that write misbehaves, writing
> some byte codes quoted as \xhh inside of a string and others literally).
> --
> David Kastrup
> _______________________________________________
> lilypond-devel mailing list
> address@hidden

Following your suggestion I managed to drop about 300 lines, reducing
it to a quarter of the original.
You definitly should earn more money!!

Of course I had to redefine `string-list->string'. I used recursion,
which was the best I could think of.
(`string-list->string' isn't used here, but I need it elsewhere)

Do you agree If I turn it into a patch?
I think `string->string-list' and `string-list->string' are very
useful tools and `char-space' might be of interest, too.

Thanks a lot,

Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]