bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: add a string-desc module


From: Jeffrey Walton
Subject: Re: RFC: add a string-desc module
Date: Fri, 24 Mar 2023 19:20:19 -0400

On Fri, Mar 24, 2023 at 5:50 PM Bruno Haible <bruno@clisp.org> wrote:
>
> In most application areas, it is not a problem if strings cannot contain NUL
> bytes, and thus the C type 'char *' with its NUL terminator is well usable.
>
> In areas where strings with embedded NUL bytes need to be handled, the common
> approach is to use a 'char * data' pointer together with a 'size_t nbytes'
> size. This works fine in code that constructs or manipulates strings with
> embedded NUL bytes. But when it comes to *storing* them, for example in an
> array or as key or value of a hash table, one needs a type that combines these
> two fields:
>
>   struct
>   {
>     size_t nbytes;
>     char * data;
>   }
>
> I propose to add a module that adds such a type, together with elementary
> functions that work on them.
>
> Such a type was long known as a "string descriptor" in VMS. It's also known
> as basic_string_view<char> in C++, or as String in Java.
>
> The type that I'm proposing does not have NUL byte appended to the data
> always and automatically, because I think it is more important to have a
> string_desc_substring function that does not cause memory allocation,
> than to have string_desc_c function (conversion to 'char *') that does
> not cause memory allocation.

I would take caution if not including a NULL. A natural thing to want
to do is print a string, and C-based routines usually expect a
terminating NULL.

Also, if you initialize the struct, then the allocated string will
likely include a terminating NULL. I understand the size member will
omit the NULL, but it will be present anyways in the string. (Unless
you do something ugly, like spell out the characters of the string).

> The type that I'm proposing does not have two distinct fields
> nbytes_used and nbytes_allocated. Such a type, e.g. [1] attempts to
> cover the use-case of accumulating a string as well. But
>   - The Java experience with String vs. StringBuffer/StringBuilder
>     shows that it is cleaner to separate the two use cases.
>   - For the use-case of accumulating a string, C programmers have been using
>     ad-hoc code with n_used and n_allocated for a long time; there is
>     no need for anything else (except for lazy people who want C to be
>     a scripting language).
>
> The type that I'm proposing also does not have fields for heap management,
> such as a 'bool heap' [2] or a reference count. That's because I think that
>   - managing the allocated memory of a data structure is a different
>     problem than that of representing a string, and it can be achieved
>     with data outside the string descriptor,
>   - Such a field would make it wrong to simply assign a string descriptor
>     to a variable.
>
> Please let me know what you think: Does this have a place in Gnulib? (Or
> should it stay in GNU gettext, where I need it for the Perl parser?)

A length prefixed string may be a good idea. It could also help with
safer string handling functions and efficient operations on a string
because length is already available.

So if you are going to add the "string descriptor", then I hope you
add some functions to make it easier for less experienced folks to
write safer code.

> [1] https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt
> [2] https://github.com/maxim2266/str

Also see libbsd's stringlist.h for some inspiration,
https://cgit.freedesktop.org/libbsd/tree/include/bsd/stringlist.h .

Jeff



reply via email to

[Prev in Thread] Current Thread [Next in Thread]