texmacs-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Texmacs-dev] string encoding


From: Joris van der Hoeven
Subject: Re: [Texmacs-dev] string encoding
Date: Fri, 15 Nov 2002 15:36:48 +0100 (MET)

> What format are texmacs strings encoded in? ASCII, ANSI, ISO-8859-1,...?

At the moment this is a bit tricky, because we do not yet use a universal
encoding (in the case of Cyrillic and Greek, the encodings are not
as they should be).

Let me rather tell you what the encoding should and will be
(for most European languages, this is already the case):
we use the Cork encoding for all characters except < and >
(Cork coincides with ISO-8859-1 for most west-European characters).

The < and > characters are used as a placeholder for extensions:
< followed by any sequence of non-<> characters, followed by >
is interpreted as one single character. This allows us to use
a potentially infinite alphabet. In particular, the TeXmacs encoding
will englobe unicode in the future, without being limited to unicode
(we do have some mathematical characters which are not in unicode).

I still have to think about how to reencode alphabets like cyrillic
and greek. We might denote cyrillic characters by <cyr:a>, <cyr:b>, etc.
or otherwise by <ça>, <çb>, where "ç" is just a character in the range
0x80-0xff which stands for "cyrillic". Notice that it maybe important
to foresee operations like "upcase" in an intelligent way.

Notice that it seems to be too late to switch to pure unicode,
but I may still have a look at that in more detail.
In any case, it would be semi-unicode in the sense that
we absolutely want to keep a potentially infinite alphabet.
Furthermore, I do not know whether there has been detailed thought
about operations like "upcase" for unicode characters.

Finally, I notice that we merely need something with at least
the same power as unicode, together with converters to other encodings.
The particular choice of a given universal encoding should be made
as a function of the easy and efficiency with which such an encoding
can be manipulated in an automatic way. For instance, unicode might be bad
for moving one character back in a string. The current TeXmacs encoding
is sometimes a bit clumsy due to the fact that it does not coincide
with ASCII in the range 0x00-0x7f (it does except for < and >).
Unicode is good from the storage point of view though.
Nevertheless, the Cork encoding is better for most western
European languages. Notice finally that export/import filters
may always store files on a disk in an encoding which suits the user.

> BTW, has my string_search_and_replace.patch been or applied or has it
> been rejected? Just curious...

I don't remember having seen that patch (but I receive soooo many messages)...
If it is on Savannah, then I will take a look at it and let you know
as soon as I work through the incoming patches.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]