Re: [mtools] Short filenames, codepages and possible mtools/kernel bug

info-mtools

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [mtools] Short filenames, codepages and possible mtools/kernel bug

From:	Alain Knaff
Subject:	Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Date:	Wed, 31 May 2006 11:00:41 +0200
User-agent:	Thunderbird 1.5 (X11/20051201)

Jaime wrote:

On Mon, 2006-05-29 at 12:57 +0200, Alain Knaff wrote:
If you see C7 for the Ç, it is ok (and the mess up only happened ondisplay), if something else, then it is indeed an mtools bug.
Yup, the C7s are there in all the right places.


Good

(the capital C cedilla has been replaced by a tiny white question mark
inside a black diamond/lozenge). Just to check, I mount the filesystem
using the following command:

mount -t msdos -o codepage=850 /dev/fd0 temp

Try mount -t vfat instead to get long names and extended characters)


Er, I don't think I want long names. But please bear with me here - I'm
a complete "character encoding" newbie, and I'm trying to learn how it
works.

"Long names" and "Unicode" are tied together. Vfat works by havingUnicode filename entries in addition to the old size-limited MSDOSCodepage entries. So, if you have long names, you'll get unicode (oriso-8859-1) encoding for free.

"ls -b" returns "ab\200de.txt" so it's using octal 200 rather than octal
307, but I assume that's because the 80s (hex) on the disk are 200s in
octal (my "mount -t msdos" means that it's the short filenames which are
used, rather than the long filenames, so I get the 8.3 "codepaged"
version, rather than the long filename in unicode).


Yes, \0200 octal is indeed 0x80 hex, i.e. the old MSDOS encoding of Ç.

As you mounted the filesystem as MSDOS, it only "sees" the legacy entry(in MSDOS codepage). With vfat, it would look at the extended entries,which do have a more sane encoding. Even if vfat doesn't supportUnicode, you should still get correct results, as the lower 256codepoints are _identical_ with ISO-Latin-1.

Theoretically, even for the MS-DOS filesystem, there is a codepage=option to perform appropriate translations of encodings, but it doesn'tappear to work.


Vfat does work, by looking at the "long" entries.

It suspect the error might be in the terminal program that you are using(which might be set to display UTF-8. Try changing that to ISO-8859-1a.k.a Iso-Latin-1)
Apart from not knowing (yet) how to do this, wouldn't this affect the
output for other mounted filesystems that _do_ use utf-8?


Yes, if you have any such filenames.

I read somewhere that short filenames on fat filesystems (on windows
systems) are encoded using the local codepage (which, when I created the
file under windows, was 850). This at least agrees with the hexdump of
my raw diskette (seeing the 80s make sense to me). What I'm really after
is the ability to view the short filenames on the diskette as they were
typed in under Windows.

After doing so more investigation, I found the following statement here:
http://svn.haxx.se/dev/archive-2005-05/0406.shtml

"The POSIX way of making filename encoding locale-dependent is
fundamentally broken IMO. But I don't think each tool can solve a system
problem. On POSIX systems, I think the best solution is to rely on the
locale like we currently do. People should set up their locale correctly
and ensure that filenames are in the encoding of the locale."

I don't fully understand this, but I think it means that POSIX (and
therefore Unix/Linux?) assumes that filenames are stored in a
character-encoding which is "represented" by the user's locale. But then
I have a problem: I only have one locale (at a time) but I have several
mounted filesystems - the majority have unicodish (<<new word?)
filenames (as they're ext3) but I want a filesystem with codepage 850
filenames mounted at the same time (my dos diskette). And there's my
problem. Many different simultaneous filename encoding mechanisms, but
only one locale.

The way Unicode is handled on "standard" filesystems (reiserfs, ext2,ext3, minix, etc.) is by the applications. UTF-8 (the usual Unicoderepresentation) was defined in such a way that:

1. Codepoints 0-127 (legacy ASCII) correspond with legacy Iso-Latin-1
2. Above 128, escape bytes are introduced.

3. All escaped filenames are proper filenames (do not contain zero bytesor slashes)

I.e. the filesystem doesn't need to know about it, it all happens in theterminal programs and editors (which do read their environment variablesto know whether they should display a sequence of bytes as UTF-8 orIso-latin-1). Applications which are not concerned with actuallydisplaying such strings don't need to care (it's just a string of bytesto them, they don't need to know the visual representation of it).

Garbled filenames will result if you create names in one terminal whichis set to Unicode, and read it using another, which is set toiso-latin-1. But that happens only for characters above 128, never forstandard Ascii. So, if all your filenames are standard Ascii anyways,you won't see the difference.

Now, with Dossy filesystems, it's a little bit more complicated. As younoticed, VFAT stored fixed-length-character Unicode (UCS-16) filenameson disk. So these need to be translated back to utf-8 (or iso-latin-1)by the filesystem layer (vfat module, or mtools). As in mtools, unicodeis not yet implemented, it just assumes iso-latin-1 (ignoring the highbyte).

I'm now beginning to think that mtools really isn't at fault here (and
that it's more a Posix/Linux limitation). Many thanks, and apologies for
the noise.

Jaime


No problem

Alain

_______________________________________________
mtools mailing list
address@hidden
http://www.tux.org/mailman/listinfo/mtools

[Prev in Thread]

Current Thread

[Next in Thread]

[mtools] Short filenames, codepages and possible mtools/kernel bug, Jaime, 2006/05/28
- Re: [mtools] Short filenames, codepages and possible mtools/kernel bug, David C Niemi, 2006/05/28
  - Re: [mtools] Short filenames, codepages and possible mtools/kernel bug, Alain Knaff, 2006/05/29
    - Re: [mtools] Short filenames, codepages and possible mtools/kernel bug, Jaime, 2006/05/30
    - Re: [mtools] Short filenames, codepages and possible mtools/kernel bug, Alain Knaff <=

Prev by Date: Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Next by Date: Re: [mtools] Mtools support for media with >512 byte sectors
Previous by thread: Re: [mtools] Short filenames, codepages and possible mtools/kernel bug
Index(es):
- Date
- Thread