[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?

From: Dario Teixeira
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Tue, 30 Mar 2010 06:26:03 -0700 (PDT)


> Thanks for you answers. Again, as I discovered Source-highlight very
> recently, I don't know if Unicode is an important feature for you or
> not... I read sometimes source code from Japanese or Chinese
> developers, and am French myself, so that's not unusual to store code
> or text files in Unicode (I mostly work with Visual Studio).

I would say that Unicode is an essential feature.  In fact, I thought
Source-highlight was already Unicode-compliant, since this is 2010 and
is hard to imagine an application that isn't.

> Unicode files (UTF-8 for example, which is widely used on the Internet)
> can store characters on 1 to 6 bytes. So of course it's very difficult
> to use (length() and so are difficult)

I think you are exaggerating the difficulty of dealing with variable-length
encodings such as UTF-8.  In fact, almost every library I know that deals
with Unicode does so using the UTF-8 encoding.  Sure, finding the Nth
element of a string is a O(n) operation instead of O(1), but many other
common operations such as strcpy() and strcat() are done the same way as
with a fixed-length encoding.

> 1) First you have to know if the file is Unicode or not. They
> should have a header, described here:

Some of us use Source-highlight as a library, and therefore that
determination should be made by the main application.  I suggest
that the core functions of Source-highlight be parameterised over
the encoding used.  Almost everyone uses either single-byte
(non-Unicode, thus) or Unicode in the form of UTF-8. There's also
some UTF-16 out there and even UTF-32 (aka UCS-4), but these are
less common.  In fact, if Source-highlight were to support only
single-byte encodings and UTF-8, I would deem it Unicode-compliant.

> 2) The second thing is to convert the whole file to a "fixed bytes
> per character" format, so you can work with it. A wide char format
> (16 bits wchar) is a good choice most of the time.

Actually, 16-bits wchar is a terrible choice, since Unicode code-points
require 32-bits.  Also, you don't necessarily need to convert the whole
file to a fixed-length encoding.  Why not simply work natively in UTF-8?
It's really not as difficult as you make it to be...

> Don't know too much on the Linux side, but it's simply a matter of
> wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with
> Visual Studio.

Using UTF-8, you will need a special length() function, but you can
use the regular strcpy() and strcat().  I don't use C++, but a quick
google search tells me there are libraries out there that provide
UTF-8 support for C++.

Dario Teixeira

reply via email to

[Prev in Thread] Current Thread [Next in Thread]