octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: char type in Octave


From: Rik
Subject: Re: char type in Octave
Date: Thu, 24 May 2018 09:36:51 -0700

On 05/24/2018 09:00 AM, address@hidden wrote:
Subject:
Re: char type in Octave
From:
mmuetzel <address@hidden>
Date:
05/24/2018 08:29 AM
To:
address@hidden
List-Post:
<mailto:address@hidden>
Content-Transfer-Encoding:
quoted-printable
Precedence:
list
MIME-Version:
1.0
References:
<address@hidden> <address@hidden>
In-Reply-To:
<address@hidden>
Message-ID:
<address@hidden>
Content-Type:
text/plain; charset=UTF-8
Message:
4

TL;DR: Let's stay with UTF-8.

Longer version:
I had a (not so) quick look at the code and the amount of effort for
switching our char representation seems unreasonably high.
If we kept our current 8-bit representation, the main "issue" from a user's
point of view might be with indexing: A user might suspect that a char
vector with N characters would always have N elements and indexing the n-th
element would return the n-th character.
But even if we moved from a 8-bit representation of characters to a 16-bit
representation, we wouldn't be able to represent characters from higher
Unicode plains with one char element. Even if we went one step further and
used a 32-bit representation, there are character modifiers (e.g. accents).
So one character could always be represented by several basic elements
(8-bit, 16-bit, or 32-bit).
Thus, indexing into character arrays will always be problematic in some
cases. No matter which UTF-flavour we would be using.
I am seconding Rik's and Michael's reasoning and would like to vote for
staying with 8-bit chars.
I do think that is a good idea.  And UTF-8 is well understood, which means we don't need to work out a solution from scratch.  There must be loads of other programs who have made the transition, and we can use the same strategy they did.


However, I am still in favor of consistently using and supporting Unicode
(UTF-8) wherever possible.
We could facilitate the possible issue with indexing by providing dedicated
functions. These could help with indexing into char arrays by identifying
elements that belong to one character.
Something along the lines of:
str = 'aäbc'
str_idx = u8_char_idx(str)

which could result in:
str_idx = [ 1 2 2 3 4 ]

Indexing the n-th character would be as easy as:
str(str_idx==n)

That also leads back to my initial doubt of whether "element-wise" operators
on character arrays like isupper or islower should return an array of the
same size as the input. IMHO they should.

Yes, one core programming idea is the Principle of Least Surprise (https://en.wikipedia.org/wiki/Principle_of_least_astonishment).  As a programmer I would be very surprised--even upset--if I called a function like toupper with a 5-byte string, and it came back as a 10-byte string.

--Rik


reply via email to

[Prev in Thread] Current Thread [Next in Thread]