guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem with UTF-8, "write, " and some characters using initial loca


From: Mike Gran
Subject: Re: Problem with UTF-8, "write, " and some characters using initial locale
Date: Sat, 20 Nov 2010 09:53:51 -0800 (PST)

> From: Taylor Venable <address@hidden>


> Hi there, I'm having a strange problem using "write" in recent Git
> versions.  When I include certain characters in a string passed to
> write, it prints odd  hex representations of the Latin-1 encodings of
> those characters: "odd"  because the result is not valid UTF-8 even
> though I believe my environment  indicates it should be outputting in
> UTF-8. I've put an example interaction  on my website:
> [http://metasyntax.net/tmp/guile.txt] (opening it in a hex  editor is
> helpful to check that the characters which are correct are  properly
> UTF-8 encoded) After I queried my locale environment using  (setlocale
> LC_ALL) then everything gets written properly - from the  documentation
> it seemed to me that this should not have a side effect unless  the
> "locale" argument was provided. I'm using Guile 1.9.13.91-b7106  on
> Linux x86_64. It seemed to me like it had bug potential, but maybe  my
> understanding of locales and encodings is flawed. Please let me know
> if  there's any other information I can provide or things I can test.
> 
> Best  regards,

Hi Taylor,

You should basically always call (setlocale LC_ALL "") before 
working on non-ASCII code.

Guile starts up in Latin-1.  It may seem that Guile should
pick up your environment's LANG or LOCALE on startup, but, most
compilers (including gcc) don't do that by default.

When you call setlocale, Guile picks up thelocale of your session.  

So, in your first line in your example, you pasted in a string of
utf-8 text.  Guile read it in raw bytes and never tried to unpack 
those bytes into Unicode characters.  You can prove it to youself by
passing your string to the string-length procedure.  You'll get
the length of the utf-8 bytes of your string, not the actual number
of characters.

The weird escapes come from trying to write a string of utf-8 bytes
in the latin-1 encoding.  The latin-1 characters from 0x80 to 0x9F
are the ISO-8859-1 C1 control characters and not printable.
So, (write) prints them as escapes instead.

For example, the stroked D (U+0110)
- is passed in to Guile as the utf-8 representation 0xC4 0x90
- Guile knows that ISO-8859-1 0x90 is an unprintable control character
- Guile prints the 0xC4 as iso-8859-1 umlaut A and prints 0x90 as an
  escape string "x90"
- Your terminal sees an 0xC4 byte, which illegal under UTF-8, and 
  probably prints a question mark
- and then your terminal prints the "x90" string

So, counterintuitive, but, not a bug.

Thanks,

Mike Gran




reply via email to

[Prev in Thread] Current Thread [Next in Thread]