help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Chet Ramey
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Fri, 28 Jan 2022 11:19:01 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.4.1

On 1/27/22 7:29 PM, Christoph Anton Mitterer wrote:

The following variables shall affect the execution of the shell:

which I'd have interpreted as "during runtime"?

Sure, if you inherit them from the environment. There's no
requirement that
the shell update its idea of the locale based on assignments.

I agree in so far, that it's not said explicitly, but doesn't it kinda
follow from:

I think so, too, but as you say the requirement is not explicit.

a) all the other variables there, which AFAIU clearly have a 1:1
    binding between variable value and whatever the shell internally
    thinks is going on, especially:
    PWD  PATH  IFS
    Sure that shell could internally set something different than what
    these contain, but it would kinda defeat their purpose.

Again, the shell uses these explicitly to perform actions, where the effect
of the locale variables is not quite so spelled out.


b) the wording, e.g.:
    "LC_CTYPE Determine the interpretation..."
    When that is the "thing" that determines the property, that must be
    bound 1:1 to the actual state, or otherwise it wouldn't determine it
    or then at least have to "act" to immediately set it again, when it
    would differ

This one's tricky, since the shell is expressly prohibited from using a
change to LC_CTYPE to modify parsing (which primarily affects which
characters are <blank>s).

There is an active ongoing discussion about this that periodically flares
up. For instance, how do you write portable scripts that contain non-ASCII
characters that might not exist in the user's current locale? If you use
a euro or yen character, how do you ensure that it's treated as such and
not as some other character sequence (say, when matching in a case
statement pattern) if the current locale doesn't contain such a character?
It's not good form to force your own locale on the user, who might not
even have it installed. So some shells dodge the question entirely: you get
what you got when the shell started. Others, like bash, treat everything as
multibyte characters whose interpretation is subject to the current locale.


AFAIU, there is a subtle difference between the LANG/LC_* shell
variables on the one side  and  setlocale() respectively the
process'
real "internal" locale state on the other side.

I think the difference is in what the system considers to be the
"default
locale."

Which AFAIU is implementation defined, right?

Yes.


TBH, I didn't even fully understand from the manpage what e.g. glibc
does if *nothing* (no LANG/LC_*/etc) is set.
I.e. what if one calls setlocale(LC_ALL, ""); but no env vars are set?

I'd guess then these apply:
  > If its value is not a valid locale specification, the locale is
  > unchanged, and setlocale() returns NULL.

The "taking into account" language offers some flexibility.

plus:
  > On  startup of the main program, the portable "C" locale is
  > selected as default.

 From that I'd deduce, that for glibc, the default/"native" locale (if
no envvars are set) would be "C"?

I suppose, since there's no portable way for it to interrogate what the
user has set in some random preference pane.


But in principle an implementation would be free to say that the
default locale is set in /etc/defaultlocale or that it depends on the
passwd GECOS field of a user and where that user lives.

Sure.

So effectively, *within* the shell (e.g. a script) there is no
guaranteed way to determine to original status of the locale.

At an arbitrary point? After the script may have modified/unset any of the
locale variables? No. You have a decent chance at an approximation if you
save the values and attributes of each locale variable as the first thing
your script does and go from there, but you have to do some work.


Right?

If you want to see what happens when a process gets called without any
locale environment variables and, in theory, should execute in the system's
"native environment," do it:

env -i /usr/bin/locale


- With the shell variables, both are stored, the default/overriding
    LANG/LC_ALL as well as the "real" categories LC_* (all but ALL).

- With setlocale() however, LC_ALL means basically just to go over
each
    "real" category and set them,... so only the "real" categories
are
    stored and internally LC_ALL isn't kept.

Right so far?

I see what you mean, for some value of "kept."

"kept" in the sense as to what the actual locale state of each non-
LC_ALL category is set to.

Whereas LC_ALL isn't a "real" category, but just some "do it for all"
sepcial value (with setlocale() - unlike with the env vars).

OK, yes, LC_ALL as a distinguished value passed as the first argument to
setlocale() has a special meaning. But so does an environment variable
named `LC_ALL', which results in the former action. I guess I don't see
an effective distinction.

    At least glibc seems to only use LC_ALL, LC_* and LANG (in that
    order) for that, so most likely some combination of them *is*
    actually set in the environment and thus also as shell variable.

Not always.

My guess is that you're going to get LANG, at least, set to the user's
preferred locale in the environment. The `system', whether that's the
OS or window system, will arrange that very early on and rely on
environment variable inheritance. A terminal emulator might look at a
preference pane value stored somewhere and start the shell with one or
more environment variables reflecting that preference. But this is all
very environment-dependent (heh).


If in the shell, any of LANG/LC_* is changed (set to a(nother) value or
unset)... e.g. when unsetting LC_ALL after stripping off the sentinel,
then it has to:

- call setlocale(category, "")
   with, category being the the same as the the shell variable that was
   just changed (and when LANG was changed, I'd guess LC_ALL must be
   used)

Not quite. Let's say your sane shell wants to effect changes to the locale
if the user modifies shell variables that are not exported. You have to
duplicate the behavior of setlocale() using the normal shell variable
lookup mechanism.

If you change any variable except LC_ALL, and LC_ALL is set, you don't have
to do anything. If you change LANG, and LC_ALL is not set, you have to
iterate through all the locale variables you know about, but only change
the ones for which the corresponding locale variable is unset. If you
change another LC_ variable, you call setlocale() only if LC_ALL is not
set, and so on. If you change or unset LC_ALL, you have to iterate through
all the locale variables and set the corresponding category to the value of
the variable or $LANG.

You have to maintain the environment variable precedence hierarchy, but use
shell variables while doing it.

You have to do all this yourself if you want shell variable modifications
to affect the locale, since setlocale() only looks at the environment,
which by definition contains only exported variables. We kind of covered
this process in the last message.


   Using "" as value should, AFAIU, automatically consider all the
   LANG/LC_* and ultimately fall back to the implementation defined
   default (which might e.g. be C).

Only if you want to ignore any unexported shell variable changes.

        
And that should already guarantee, that if the LANG/LC_* shell
variables are restored to the original state (in terms of set/unset and
value)... the locale state should be back to what it was before - even
without knowing what it originally was.

Sure, if you duplicate the setlocale() behavior. But remember that if you
unset an arbitrary LC_ variable without knowing its export status, if you
reset it without exporting it (if it was previously exported), you've just
changed the behavior of future child processes. That may or may not matter
to you.


There's just one aspect which I don't understand yet:
Further below you wrote before, that shell's don't update their own
environment with the values of their own LANG/LC_* shell variables
(Nobody does that.).
If so, how can the setlocale(foo, "") call now what the current
LANG/LC_* is? AFAIU it takes these from the env vars?

You can't use setlocale(LC_XX, "") except under strict circumstances where
you've already done the precdence checks using shell variable lookup. You
have to figure out the values to use yourself.

There are other things you can try. Bash, for instance, provides an
internal implementation of getenv() that looks in the shell variable table.
A traditional Unix linker will force library functions to use that
definition instead of the one in libc. Bash will also set `environ' to
the export environment it passes to its child processes, and library
functions can see that, but that's not guaranteed to work, either. It's
dangerous to assume too much about the libc implementation of a particular
function.


   I couldn't find what happens when for a category, no value can be
   determined (e.g. LANG, LC_ALL and LC_CTYPE unset)... but I guess
it
    falls back to "C"?!

"the empty string "" (which denotes the native environment)"

Sure it understood that,... what I couldn't find for glibc was, what it
considers as "native environment".. and I guess *that* would be "C"
(for glibc).

Probably. Or "POSIX".

And I'd assume, that if the shell has neither of LANG/LC_* in it's
shell variables, then it simply takes "" ... and then it's up to the
implementation of set the default/"native" locale?


    (unless it updated its own environment before then it could
simply
    use "")?

Nobody does that.

But... if the above (**) is right... wouldn't it have to at least clear
the LANG/LC_* from it's env, so that when they had all been unset as
shell variables and "" is used with setlocale, the later wouldn't take
them from the env?

There's just not a portable way to do that that's guaranteed to work. Even
if you change `environ', there's nothing prohibiting a libc getenv
implementation from reading the environment on the first call and caching
it internally.


As I said: "The only thing you really need to do is to set and reset
LC_ALL
around the single assignment statement that removes the last byte
from the
string."

I assume you mean this only with respect to bash, but I'd expect that
with any other shell that handles locales in a "proper" way... it would
work like that the same, right?

Yes. And most shells that don't.

 From that reply of yours in the last mail I already suspected that we
just had some misunderstanding.

I was confused by:

So if you want to temporarily control the locale a command
substitution, or any program the shell runs, gets, you have to save,
set, export, and then optionally reset all the variables you care
about.

Which made me think, that even in my use case, I need to reset *all* of
the variables (i.e. even those that I don't touch).

It gets tricky. Say you unset LC_ALL, which the shell inherited from the
environment, and that `unshadows' a value of, for instance, LC_CTYPE that
the shell also inherited in its environment. Child processes will see a different value for that category than you may have intended.



You don't need to mess with setting LC_ALL to anything earlier in the
script, and you don't need to worry about hypotheticals like the
shell
doing some character conversion on assignment. Nor do you need to
worry
about the effect of adding a byte to some incomplete multibyte
character.

So long story short:


result="$(command ; e=$?; print '.' ; exit $?)"

You've already seen the problems with this in another message thread.


#optionally error out if OLD_LC_ALL is already set
unset -v OLD_LC_ALL ; [ "${LC_ALL+is_set}" ] && OLD_LC_ALL="${LC_ALL}"

LC_ALL=C
result="${result%.}"

[ "${OLD_LC_ALL+is_set}" ] && LC_ALL="${OLD_LC_ALL}" || unset -v LC_ALL

and you need to unset OLD_LC_ALL. But yes, this should work just fine.



I skimmed your message to the austin-group mailing list, and I don't
really see any of these concerns as making a difference.

Saw it, and thanks for your replies there as well.

I'd also say they make no big difference. Question (1) there was mainly
just trying to understand whether one could bail out of the whole
locale stuff (which would make the solution a bit easier),...

But I already suspected that, as Koichi said, one could *not* just
trust in the stripping to work "properly" (or rather "as one would
wish") if the string contains some encoding that is invalid in the
current locale... even if the rightmost character would be invariant
and not allowed to be part in any other encoding.

The wide-character behavior of yash aside, any shell that doesn't treat
a character that is not part of any other encoding as the character itself
(e.g., `.') has a bug. But if you can't guarantee that it's not part of
any other encoded character, you're right -- all bets are off. That's one
of the attractive features of UTF-8 encoding.


Well, I mean from inside.

Sure. Interrogate the state of the relevant shell variables and apply
the
appropriate precedence rules. If none are set, run `locale' and parse
its
output, for example

locale | sed -n 's/^LANG="\(.*\)"/\1/p'

That will give you a pretty good idea of the native environment.

But that would depend that env (which is AFAIK not a special built-in)
uses the same libc implementation then the shell right?

Sure, but you can only go so far down the rabbit hole of `what if' before
the exercise loses its value.


Otherwise, if neither of LANG/LC_* is set, the shell's call of
setlocale(foo, "") could result in one native locale ... while env's
implementation of setlocale() with another libc might use something
different?

Okay... I'm pedantic, sorry ;)

As the hypotheticals become less and less likely, their value decreases
proportionally.


Uhm, I found no portable way to get the export state.

Parse the output of `export'.

I looked at that, but:
a)
e.g. bash uses a format like this:
declare -x DESKTOP_SESSION="cinnamon"
declare -x DISPLAY=":0"

whereas POSIX would mandate:
export DESKTOP_SESSION='cinnamon'
export DISPLAY=':0'

Only if you use `export -p'. Otherwise the output is unspecified.

Bash uses that output format in posix mode. You can also use an external
program like `env' or `printenv' and parse the output directly. If you're
worried about running the script in different shells, I mean.


Also,... what if a variable like:
export FOO=$'\nexport LC_ALL=bar'

You're already in non-portable territory there, but your example would work
even if you use a portable assignmnent.

would be in it? I couldn't differentiate between what's really a
variable and what's just a value.

export FOO=$'\nexport LC_ALL=bar'
POSIXLY_CORRECT=1 export -p | grep '^export FOO='

As the hypotheticals become less and less likely...

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]