info-mtools
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [mtools] Short filenames, codepages and possible mtools/kernel bug


From: Alain Knaff
Subject: Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Date: Wed, 31 May 2006 11:00:41 +0200
User-agent: Thunderbird 1.5 (X11/20051201)

Jaime wrote:
On Mon, 2006-05-29 at 12:57 +0200, Alain Knaff wrote:
If you see C7 for the Ç, it is ok (and the mess up only happened on display), if something else, then it is indeed an mtools bug.

Yup, the C7s are there in all the right places.

Good

(the capital C cedilla has been replaced by a tiny white question mark
inside a black diamond/lozenge). Just to check, I mount the filesystem
using the following command:

mount -t msdos -o codepage=850 /dev/fd0 temp
Try mount -t vfat instead to get long names and extended characters)

Er, I don't think I want long names. But please bear with me here - I'm
a complete "character encoding" newbie, and I'm trying to learn how it
works.

"Long names" and "Unicode" are tied together. Vfat works by having Unicode filename entries in addition to the old size-limited MSDOS Codepage entries. So, if you have long names, you'll get unicode (or iso-8859-1) encoding for free.

"ls -b" returns "ab\200de.txt" so it's using octal 200 rather than octal
307, but I assume that's because the 80s (hex) on the disk are 200s in
octal (my "mount -t msdos" means that it's the short filenames which are
used, rather than the long filenames, so I get the 8.3 "codepaged"
version, rather than the long filename in unicode).

Yes, \0200 octal is indeed 0x80 hex, i.e. the old MSDOS encoding of Ç.
As you mounted the filesystem as MSDOS, it only "sees" the legacy entry (in MSDOS codepage). With vfat, it would look at the extended entries, which do have a more sane encoding. Even if vfat doesn't support Unicode, you should still get correct results, as the lower 256 codepoints are _identical_ with ISO-Latin-1.

Theoretically, even for the MS-DOS filesystem, there is a codepage= option to perform appropriate translations of encodings, but it doesn't appear to work.

Vfat does work, by looking at the "long" entries.


It suspect the error might be in the terminal program that you are using (which might be set to display UTF-8. Try changing that to ISO-8859-1 a.k.a Iso-Latin-1)

Apart from not knowing (yet) how to do this, wouldn't this affect the
output for other mounted filesystems that _do_ use utf-8?

Yes, if you have any such filenames.


I read somewhere that short filenames on fat filesystems (on windows
systems) are encoded using the local codepage (which, when I created the
file under windows, was 850). This at least agrees with the hexdump of
my raw diskette (seeing the 80s make sense to me). What I'm really after
is the ability to view the short filenames on the diskette as they were
typed in under Windows.

After doing so more investigation, I found the following statement here:
http://svn.haxx.se/dev/archive-2005-05/0406.shtml

"The POSIX way of making filename encoding locale-dependent is
fundamentally broken IMO. But I don't think each tool can solve a system
problem. On POSIX systems, I think the best solution is to rely on the
locale like we currently do. People should set up their locale correctly
and ensure that filenames are in the encoding of the locale."

I don't fully understand this, but I think it means that POSIX (and
therefore Unix/Linux?) assumes that filenames are stored in a
character-encoding which is "represented" by the user's locale. But then
I have a problem: I only have one locale (at a time) but I have several
mounted filesystems - the majority have unicodish (<<new word?)
filenames (as they're ext3) but I want a filesystem with codepage 850
filenames mounted at the same time (my dos diskette). And there's my
problem. Many different simultaneous filename encoding mechanisms, but
only one locale.

The way Unicode is handled on "standard" filesystems (reiserfs, ext2, ext3, minix, etc.) is by the applications. UTF-8 (the usual Unicode representation) was defined in such a way that:
1. Codepoints 0-127 (legacy ASCII) correspond with legacy Iso-Latin-1
2. Above 128, escape bytes are introduced.
3. All escaped filenames are proper filenames (do not contain zero bytes or slashes)

I.e. the filesystem doesn't need to know about it, it all happens in the terminal programs and editors (which do read their environment variables to know whether they should display a sequence of bytes as UTF-8 or Iso-latin-1). Applications which are not concerned with actually displaying such strings don't need to care (it's just a string of bytes to them, they don't need to know the visual representation of it).

Garbled filenames will result if you create names in one terminal which is set to Unicode, and read it using another, which is set to iso-latin-1. But that happens only for characters above 128, never for standard Ascii. So, if all your filenames are standard Ascii anyways, you won't see the difference.

Now, with Dossy filesystems, it's a little bit more complicated. As you noticed, VFAT stored fixed-length-character Unicode (UCS-16) filenames on disk. So these need to be translated back to utf-8 (or iso-latin-1) by the filesystem layer (vfat module, or mtools). As in mtools, unicode is not yet implemented, it just assumes iso-latin-1 (ignoring the high byte).


I'm now beginning to think that mtools really isn't at fault here (and
that it's more a Posix/Linux limitation). Many thanks, and apologies for
the noise.

Jaime

No problem

Alain

_______________________________________________
mtools mailing list
address@hidden
http://www.tux.org/mailman/listinfo/mtools


reply via email to

[Prev in Thread] Current Thread [Next in Thread]