[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?

From: Dario Teixeira
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Tue, 30 Mar 2010 12:52:48 -0700 (PDT)


> The problem isn't the number of the year, but that C/C++ and its
> standard libs neglected advanced string processing aside char (even
> wchar is kind of a step child even in todays C++ programming) for a
> long time, so you are always reliant on some advanced lib that
> supports this (non-trivial, if correctly done) encoding stuff (QString
> is an excellent example, and in my eyes still a reference for how
> string classes should be done), or you had to roll your own, using
> what little support C is able to give. Should get better with C++0x,
> but for source-highlight I wouldn't count on it, as it will take a
> while until it's available on most platforms and installations.

I've abandoned C++ almost a decade ago, and nowadays I use mostly Ocaml.
Nevertheless the situation with the two languages is similar in the sense
that the string type in the core language is not encoding-aware.  Ocaml
users have got used to relying on external libraries if they ever need
encoding-aware handling of UTF-8.

> That may be the case, but you still need some non-standard
> infrastructure around it to make UTF-8 string processing work
> properly, and usually that's nothing that you do in one evening for
> your home-brew projects (not meant to slag you, Lorenzo ;-)).

Yes, I would not recommend either that Lorenzo implements his own
UTF-8 handling functions.  And even if he is reluctant to link against
yet another library, perhaps he can just copy+paste the required code
if the license allows it.  This latter solution is feasible because
for many applications all that is required are one or two UTF-8
specific functions, such as strlen.

> One problem, aside from strlen() (without which it's IMHO hard to
> write any string processing at all), is how to determine which type
> the string literal in your code is, or which encoding the file you're
> processing has.

Or you can just expect the caller of Source-highlight's core functions
to give you the encoding and/or provide always UTF-8 strings.  This
simplifies things tremendously.

> Never heard of any environment using UTF-32 seriously. And UTF-16 I
> know mostly from VFAT and NTFS... However, my experience in this field
> is limited.

UTF-32 is used by some people who prefer to deal with fixed-length
encodings.  And if you application requires frequent access to arbitrary
character positions, then it may be worth paying the price in extra memory
in exchange for O(1) access.

Dario Teixeira

reply via email to

[Prev in Thread] Current Thread [Next in Thread]