[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?

From: Lionel Fumery
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Tue, 30 Mar 2010 16:09:10 +0200
User-agent: Thunderbird (Windows/20100228)


I'm not neither an unicode expert, so it's great if this feature is easier than what I thought...

About working all the time in utf-8, do you mean (for example) converting utf-16 or anything else to utf-8 then working in utf-8 ? Or only supporting utf-8 files?

Thanks for your help,

Dario Teixeira wrote:

Thanks for you answers. Again, as I discovered Source-highlight very
recently, I don't know if Unicode is an important feature for you or
not... I read sometimes source code from Japanese or Chinese
developers, and am French myself, so that's not unusual to store code
or text files in Unicode (I mostly work with Visual Studio).

I would say that Unicode is an essential feature.  In fact, I thought
Source-highlight was already Unicode-compliant, since this is 2010 and
is hard to imagine an application that isn't.

Unicode files (UTF-8 for example, which is widely used on the Internet)
can store characters on 1 to 6 bytes. So of course it's very difficult
to use (length() and so are difficult)

I think you are exaggerating the difficulty of dealing with variable-length
encodings such as UTF-8.  In fact, almost every library I know that deals
with Unicode does so using the UTF-8 encoding.  Sure, finding the Nth
element of a string is a O(n) operation instead of O(1), but many other
common operations such as strcpy() and strcat() are done the same way as
with a fixed-length encoding.

1) First you have to know if the file is Unicode or not. They
should have a header, described here:

Some of us use Source-highlight as a library, and therefore that
determination should be made by the main application.  I suggest
that the core functions of Source-highlight be parameterised over
the encoding used.  Almost everyone uses either single-byte
(non-Unicode, thus) or Unicode in the form of UTF-8. There's also
some UTF-16 out there and even UTF-32 (aka UCS-4), but these are
less common.  In fact, if Source-highlight were to support only
single-byte encodings and UTF-8, I would deem it Unicode-compliant.

2) The second thing is to convert the whole file to a "fixed bytes
per character" format, so you can work with it. A wide char format
(16 bits wchar) is a good choice most of the time.

Actually, 16-bits wchar is a terrible choice, since Unicode code-points
require 32-bits.  Also, you don't necessarily need to convert the whole
file to a fixed-length encoding.  Why not simply work natively in UTF-8?
It's really not as difficult as you make it to be...

Don't know too much on the Linux side, but it's simply a matter of
wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with
Visual Studio.

Using UTF-8, you will need a special length() function, but you can
use the regular strcpy() and strcat().  I don't use C++, but a quick
google search tells me there are libraries out there that provide
UTF-8 support for C++.

Dario Teixeira


Help-source-highlight mailing list


reply via email to

[Prev in Thread] Current Thread [Next in Thread]