[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?

From: Lionel Fumery
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Tue, 30 Mar 2010 11:59:01 +0200
User-agent: Thunderbird (Windows/20100228)

Hi Lorenzo, Martin (and others maybe),

Thanks for you answers. Again, as I discovered Source-highlight very recently, I don't know if Unicode is an important feature for you or not... I read sometimes source code from Japanese or Chinese developers, and am French myself, so that's not unusual to store code or text files in Unicode (I mostly work with Visual Studio).

Unicode files (UTF-8 for example, which is widely used on the Internet) can store characters on 1 to 6 bytes. So of course it's very difficult to use (length() and so are difficult)

1) First you have to know if the file is Unicode or not. They should have a header, described here:

(Note that "bad" unicode text files are quite common (unicode text files without any header), but no need to address this here.)

2) The second thing is to convert the whole file to a "fixed bytes per character" format, so you can work with it. A wide char format (16 bits wchar) is a good choice most of the time.

Here is a FAQ explaining how to read Unicode files :

I can provide some C code source snippets to match this.

3) And then you can work with wchar functions.

Don't know too much on the Linux side, but it's simply a matter of wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with Visual Studio.

I'm going to take a look on the Source-Highlight code to see if this could be easy to add...


Lorenzo Bettini wrote:
Lionel Fumery wrote:

I'm new with Source-Highlight, just began a week ago in fact. It works fine, but I have some understanding issues about Unicode files.

For example, create a simple text file, saved as unicode, with only the word "test". If you edit this text file with an hexadecimal editor, the content will be FF FE 74 00 65 00 73 00 74 00. In this sequence, FF FE  is the unicode marker.

When highlighting this file :
    source-highlight test.txt --line-number

the resulting HTML file is incorrect :     <pre><tt><font color="#000000">1:</font> ??t?e?s?t?</tt></pre>

As you see, the Unicode behavior is just missing.

Could you please explain me if this is supported by Source-Highlight, and how can I enable it ?

Thank you a lot for your help!

Hi Lionel

actually I never dealt with unicode character thus source-highlight probably does not support it...

has anybody got any idea on how adding such support to a C++ program? Is it just a matter of using wchar for strings?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]