[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: RFC: add a string-desc module
From: |
Jeffrey Walton |
Subject: |
Re: RFC: add a string-desc module |
Date: |
Fri, 24 Mar 2023 19:20:19 -0400 |
On Fri, Mar 24, 2023 at 5:50 PM Bruno Haible <bruno@clisp.org> wrote:
>
> In most application areas, it is not a problem if strings cannot contain NUL
> bytes, and thus the C type 'char *' with its NUL terminator is well usable.
>
> In areas where strings with embedded NUL bytes need to be handled, the common
> approach is to use a 'char * data' pointer together with a 'size_t nbytes'
> size. This works fine in code that constructs or manipulates strings with
> embedded NUL bytes. But when it comes to *storing* them, for example in an
> array or as key or value of a hash table, one needs a type that combines these
> two fields:
>
> struct
> {
> size_t nbytes;
> char * data;
> }
>
> I propose to add a module that adds such a type, together with elementary
> functions that work on them.
>
> Such a type was long known as a "string descriptor" in VMS. It's also known
> as basic_string_view<char> in C++, or as String in Java.
>
> The type that I'm proposing does not have NUL byte appended to the data
> always and automatically, because I think it is more important to have a
> string_desc_substring function that does not cause memory allocation,
> than to have string_desc_c function (conversion to 'char *') that does
> not cause memory allocation.
I would take caution if not including a NULL. A natural thing to want
to do is print a string, and C-based routines usually expect a
terminating NULL.
Also, if you initialize the struct, then the allocated string will
likely include a terminating NULL. I understand the size member will
omit the NULL, but it will be present anyways in the string. (Unless
you do something ugly, like spell out the characters of the string).
> The type that I'm proposing does not have two distinct fields
> nbytes_used and nbytes_allocated. Such a type, e.g. [1] attempts to
> cover the use-case of accumulating a string as well. But
> - The Java experience with String vs. StringBuffer/StringBuilder
> shows that it is cleaner to separate the two use cases.
> - For the use-case of accumulating a string, C programmers have been using
> ad-hoc code with n_used and n_allocated for a long time; there is
> no need for anything else (except for lazy people who want C to be
> a scripting language).
>
> The type that I'm proposing also does not have fields for heap management,
> such as a 'bool heap' [2] or a reference count. That's because I think that
> - managing the allocated memory of a data structure is a different
> problem than that of representing a string, and it can be achieved
> with data outside the string descriptor,
> - Such a field would make it wrong to simply assign a string descriptor
> to a variable.
>
> Please let me know what you think: Does this have a place in Gnulib? (Or
> should it stay in GNU gettext, where I need it for the Perl parser?)
A length prefixed string may be a good idea. It could also help with
safer string handling functions and efficient operations on a string
because length is already available.
So if you are going to add the "string descriptor", then I hope you
add some functions to make it easier for less experienced folks to
write safer code.
> [1] https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt
> [2] https://github.com/maxim2266/str
Also see libbsd's stringlist.h for some inspiration,
https://cgit.freedesktop.org/libbsd/tree/include/bsd/stringlist.h .
Jeff