help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Christoph Anton Mitterer
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Wed, 26 Jan 2022 18:25:20 +0100
User-agent: Evolution 3.42.2-1

On Wed, 2022-01-26 at 18:32 +0900, Koichi Murase wrote:
> > So if that holds true... simply appending . or / as sentinel within
> > the
> > command substitution, and removing that afterwards (without any
> > need for locale changes) should *always* work, regardless of the
> > locale/encoding.
> > Can anyone confirm this?
> 
> No.  I guess that should practically work in most cases, but I don't
> think POSIX requires that it should always work.  When the data is
> not
> encoded by the current LC_CTYPE or contains misencoded byte
> sequences,

But AFAIU that shouldn't matter, cause even if wrongly encoded or not
matching the current locale/encoding:

\n \r . / are required to have the exact same (binary) representation
in any of them.... *AND* are not allowed to be part of the (binary)
representation of any other character.

So that should guarantee (at least if these two rules are obeyed by any
locale/encoding), that even a wrongly encoded character couldn't mean
that together with the sentinel it forms a valid one.

Or so, I'd think at least ^^


> it is difficult to impose any well-defined requirements on how the
> implementation should treat them.  In fact, XBD 6.1 says that the
> result is unspecified:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01
> > POSIX.1-2017 places only the following requirements on the encoded
> > values of the characters in the portable character set:
> > 
> > * If the encoded values associated with each member of the portable
> >   character set are not invariant across all locales supported by
> >   the implementation, if an application uses any pair of locales
> >   where the character encodings differ, or accesses data from an
> >   application using a locale which has different encodings from the
> >   locales used by the application, the results are unspecified.
> 

But that says in specific:
"If the encoded values associated with each member of the portable
character set are not invariant"
However:
\n \r . / are defined to be exactly that a bit further below.


> For example, suppose we have an encoding where bytes X and Y are used
> for the first and second bytes of double-byte characters, L is used
> for single-byte characters, and these sets of bytes X, Y, and L are
> disjoint (e.g., a byte that belongs to Y does not belong to the other
> sets). According to the above quotes on the POSIX, <period>, <slash>,
> etc. are required to be in L. Data correctly encoded in that encoding
> should look like e.g. "LLXYLLXYLLXYXYLL" where "X" and "Y".always
> need
> to appear in pairs. The combination "XL" is not allowed in the
> correctly encoded data, but how the implementation should behave when
> it actually finds "XL"? One possible behavior is to replace "XL" with
> "<Error>" where <Error> is a replacement character such as "�"
> (U+FFFD) or "?" that indicates that there was originally misencoded
> data at its position. Now let us consider misencoded data "X"
> suffixed
> by <period>. I wouldn't be surprised even if there is an
> implementation that converts (or sanitize) "X<period>" to "�" before
> storing it in a variable. Then the trailing <period> cannot be
> removed, and even the original byte X is replaced by different data.

Ah I see what you're having in mind.

Well first I'd say, that if the implementation already does any kind of
"replacements" of invalid encodings like in X<period>, than we're
anyway screwed, wouldn't we?
Cause that would already happen when the command substitution is
assigned to the variable.
So regardless of whether we use LC_ALL=C or so, our sentinel would be
gone.

But even if it doesn't do that, we could still end up in troubles, when
an implementation (without setting LC_ALL=C) would simply fail to strip
of the last byte from a string 'XL'.
Which wouldn't happen if we had LC_ALL=C.



> > I found several more shells that seem to not support changing
> > LC_ALL
> > during runtime (at least without effect for the shell itself): [2],
> > [3]
> 
> These shells seem to support only the locale "LC_CTYPE=C" which is
> exactly what we want to force the shell for the present purpose, so
> there aren't any problems for the present purpose, are they?

I guess you are right.
I was confused by dash printing unicode characters correctly, so I
assumed it would actually support such encodings.
$ c="$(printf '\342\210\213')"
$ printf '%s\n' "$c"
∋

However, that's then probably just something done by the terminal and
it's fonts, since e.g.:

$ echo ${#c}
3

But in theory, I guess, there could be shells which violate POSIX, do
support other encodings, but don't support switching to C.



> > It became more robust not with what Thorsten Glaser pointed out.
> 
> Yes, it is right that it was actually more robust than I thought
> then.
> Thank you for the information.  I haven't thought that POSIX imposes
> requirements on the details of the encoding so that the full support
> for ISO-2022 encoding is actually not allowed in the POSIX systems.

But still, with your idea about how the charset decoders might fail to
handle invalid encodings gracefully, it's probably better to use the
LC_ALL=C method, I'd guess.


Thanks for your help :-)
Chris.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]