[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Texmacs-dev] string encoding
From: |
Joris van der Hoeven |
Subject: |
Re: [Texmacs-dev] string encoding |
Date: |
Tue, 19 Nov 2002 10:42:46 +0100 (MET) |
> A first draft of the dictionary mapping Cork (TeXmacs) encoding to
> Unicode encoding is now finished. You can take a look at it here:
>
> http://www.fbreuer.de/texmacs/corktounicode.scm
Great, this seems cool.
It may not be really necessary to put the comments (a lot of extra work).
It is better to provide a dictionary to the symbolic names of characters.
This probably already exists for unicode. With Andrey's tool you may
then compose two dictionaries in order to have an explanation :^)
> Any suggestions and/or corrections are welcome. Does anybody have an
> idea how to test this mapping? (I.e. generate a document/table where one
> can visually verify that the mapping is correct?)
You have to be careful with the number of bytes for each character.
In the Cork encoding, each character only takes one byte,
so you should write #41 for "A" rather than "#0041".
In Unicode, some characters take one byte, some two, and some even more.
We still have to develop something for testing all this in C++.
> I didn't make a patch from the dictionary because I don't know where to
> put it in the TeXmacs source tree. How to use this dictionary to convert
> between encodings? I guess just a bit of Scheme code would do the trick,
> but I don't know Scheme well enough (yet).
We will rather write these conversion routines in C++ (they must
be really fast) in src/Resources/Translators
> Next, I am going to write a TeXmacs universal encoding <-> Unicode
> dictionary. I noticed that sometimes the universal characters are
> encoded this way: \<char\> and sometimes this way: <char>. Which of
> these two should I use in the dictionary? Or should I use just char?
You should use the <char> form, which is the one being used internally.
Andrey: I noticed that you do not put the <> around the characters
when converting from .enc to .scm. This should be done for strigns
of length >1.
> Regarding ISO-8859-*: I noticed that ISO-8859-1 is a subset of Unicode
> (see ftp://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT). How about
> the other ISO-8859-* encodings? Instead of writing a dictionary it would
> probably be more sensible to just use iconv to convert
> ISO-8859<->Unicode.
Absolutely. Maybe you can actually find this somewhere on the web.
Notice that it would be good to include the "la" encoding,
for Cyrillic too.
- Re: [Texmacs-dev] string encoding, (continued)
- Re: [Texmacs-dev] string encoding, David Allouche, 2002/11/15
- Re: [Texmacs-dev] string encoding, Joris van der Hoeven, 2002/11/15
- Re: [Texmacs-dev] string encoding, David Allouche, 2002/11/15
- Re: [Texmacs-dev] string encoding, Felix Breuer, 2002/11/15
- Re: [Texmacs-dev] string encoding, Joris van der Hoeven, 2002/11/15
- Re: [Texmacs-dev] string encoding, Felix Breuer, 2002/11/16
- Re: [Texmacs-dev] string encoding, Felix Breuer, 2002/11/16
- Re: [Texmacs-dev] string encoding, Joris van der Hoeven, 2002/11/16
- Re: [Texmacs-dev] string encoding, Felix Breuer, 2002/11/18
- Re: [Texmacs-dev] string encoding,
Joris van der Hoeven <=
- [Texmacs-dev] utf-8 support update, Felix Breuer, 2002/11/23
- Re: [Texmacs-dev] utf-8 support update, Joris van der Hoeven, 2002/11/25