[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: German Umlauts / UTF8 with comparse

From: kooda
Subject: Re: German Umlauts / UTF8 with comparse
Date: Tue, 18 Feb 2020 12:43:23 +0100
User-agent: OtterMail

Christoph Lange <address@hidden> wrote:
> Yes, this helps. Kind of ;-) ... using the character set
> char-set:alphabetic, my umlauts are now parsed. But I don't get them back
> in my result, at least not as printable characters. Instead, the following
> happens, and utterly confuses me:

Hmm, indeed. From what I can see, the result of parse is not encoded in

I went to see comparse’s code and found that the (as-string) combiner
uses (->string) internally. But since comparse doesn’t use the utf8 egg,
it uses the core version of (->string), which happens to encode #\ä in

The only workaround I can think of right now is to move the conversion
back to a string out of the comparse egg and into your own, utf8 aware,

This would look something like this:

(import comparse utf8 utf8-srfi-14 unicode-char-sets)

(define s "Gänsesäger 2,1")
(define s1 "Rotkehlchen 1,0")

(define (utf8-in cs)
  (satisfies (lambda (c) (char-set-contains? cs c))))

(define letter
  (utf8-in char-set:alphabetic))

(define letters
  (repeated letter 1 20))

(define (parse-as-string parser input)
  (list->string (parse parser input)))

(define p1 (parse-as-string letters (string->list s1)))
(define p (parse-as-string letters (string->list s)))

PS: a trick I used to check the encoding of the strings was using the ,d
csi command, which prints the contents of the string byte by byte. There
it’s easy to see if non ascii characters indeed take more than one byte
as they should in UTF-8.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]