[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Problem with UTF-8, "write, " and some characters using initial loca
Re: Problem with UTF-8, "write, " and some characters using initial locale
Sat, 20 Nov 2010 09:53:51 -0800 (PST)
> From: Taylor Venable <address@hidden>
> Hi there, I'm having a strange problem using "write" in recent Git
> versions. When I include certain characters in a string passed to
> write, it prints odd hex representations of the Latin-1 encodings of
> those characters: "odd" because the result is not valid UTF-8 even
> though I believe my environment indicates it should be outputting in
> UTF-8. I've put an example interaction on my website:
> [http://metasyntax.net/tmp/guile.txt] (opening it in a hex editor is
> helpful to check that the characters which are correct are properly
> UTF-8 encoded) After I queried my locale environment using (setlocale
> LC_ALL) then everything gets written properly - from the documentation
> it seemed to me that this should not have a side effect unless the
> "locale" argument was provided. I'm using Guile 18.104.22.168-b7106 on
> Linux x86_64. It seemed to me like it had bug potential, but maybe my
> understanding of locales and encodings is flawed. Please let me know
> if there's any other information I can provide or things I can test.
> Best regards,
You should basically always call (setlocale LC_ALL "") before
working on non-ASCII code.
Guile starts up in Latin-1. It may seem that Guile should
pick up your environment's LANG or LOCALE on startup, but, most
compilers (including gcc) don't do that by default.
When you call setlocale, Guile picks up thelocale of your session.
So, in your first line in your example, you pasted in a string of
utf-8 text. Guile read it in raw bytes and never tried to unpack
those bytes into Unicode characters. You can prove it to youself by
passing your string to the string-length procedure. You'll get
the length of the utf-8 bytes of your string, not the actual number
The weird escapes come from trying to write a string of utf-8 bytes
in the latin-1 encoding. The latin-1 characters from 0x80 to 0x9F
are the ISO-8859-1 C1 control characters and not printable.
So, (write) prints them as escapes instead.
For example, the stroked D (U+0110)
- is passed in to Guile as the utf-8 representation 0xC4 0x90
- Guile knows that ISO-8859-1 0x90 is an unprintable control character
- Guile prints the 0xC4 as iso-8859-1 umlaut A and prints 0x90 as an
escape string "x90"
- Your terminal sees an 0xC4 byte, which illegal under UTF-8, and
probably prints a question mark
- and then your terminal prints the "x90" string
So, counterintuitive, but, not a bug.