bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'wait -n' with and without id arguments


From: Zachary Santer
Subject: Re: 'wait -n' with and without id arguments
Date: Fri, 30 Aug 2024 23:06:25 -0400

CWRU/CWRU.chlog:
>    8/26
>    ----

> execute_cmd.c
> [...]
> - execute_connection: in default mode, bash performs jobs notifications
>   in an interactive shell between commands separated by ';' or '\n'.
>   It shouldn't do this in posix mode, since posix now specifies when
>   notifications can take place

I forgot your comment below about the shell not being interactive any time
it's not accepting input from the user and took this to mean that 'jobs'
notifications would only ever be printed immediately prior to a prompt when
bash is in posix mode. I don't understand what posix mode changes relative
to the existing behavior if not that.

> jobs.c
> - notify_and_cleanup: make interactive shells notifying during sourced
>   scripts dependent on the shell compatibility level and inactive in
>   versions beyond bash-5.2
>   Inspired by report from Zachary Santer <zsanter@gmail.com>

Making 'jobs' notifications not happen while the interactive shell is
sourcing a script misses the cases where a function is otherwise executed
directly from the command line and of course a whole bunch of commands
separated by semicolons entered in one command line.

New wait-n-failure attached. (Apparently ${SECONDS} can't be declared local
and still work.)

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: msys
Compiler: gcc
Compilation CFLAGS: -g -O2
uname output: MSYS_NT-10.0-19045 Zack2021HPPavilion 3.5.3-d8b21b8c.x86_64
2024-07-09 18:03 UTC x86_64 Msys
Machine Type: x86_64-pc-msys

Bash Version: 5.3
Patch Level: 0
Release Status: alpha

Devel branch commit 2610d40b.

$ ./bash ~/random/wait-n-failure run
run true
explicit_pids false
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
11 seconds

$ ./bash ~/random/wait-n-failure run explicit_pids
run true
explicit_pids true
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
12 seconds

$ ./bash ~/random/wait-n-failure run monitor
run true
explicit_pids false
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
11 seconds

$ ./bash ~/random/wait-n-failure run monitor notify
run true
explicit_pids false
monitor true
notify true
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
11 seconds

$ ./bash ~/random/wait-n-failure run monitor posix
run true
explicit_pids false
monitor true
notify false
posix true
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
11 seconds

$ ./bash ~/random/wait-n-failure run explicit_pids monitor
run true
explicit_pids true
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
12 seconds

All good.

$ source ~/random/wait-n-failure run
run true
explicit_pids false
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
96 processes waited / 100 processes forked
12 seconds

Hmm.

$ source ~/random/wait-n-failure run explicit_pids
run true
explicit_pids true
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
100 processes waited / 100 processes forked
10 seconds

Better.

$ source ~/random/wait-n-failure run monitor
run true
explicit_pids false
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
[5]+  Done                       wait-n-failure_random_sleep
[1]   Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]-  Done                       wait-n-failure_random_sleep
[5]-  Done                       wait-n-failure_random_sleep
[6]-  Done                       wait-n-failure_random_sleep
[7]-  Done                       wait-n-failure_random_sleep
[8]   Done                       wait-n-failure_random_sleep
[9]   Done                       wait-n-failure_random_sleep
[10]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[... All following "Done" notifications are for jobs with job id 1.]
96 processes waited / 100 processes forked
11 seconds

I did not expect to see job notifications here. The changelog seems pretty
clear that there shouldn't be any. We get to see what was going on above,
though. After a little while, there's only one child process running at a
time - why they all get assigned job id 1. So 'wait -n' is now guaranteed
to wait for *something*, but it won't necessarily wait for everything. Four
concurrent processes have been lost by the time the script completes.

$ source ~/random/wait-n-failure run monitor notify
run true
explicit_pids false
monitor true
notify true
posix false
bash 5.3.0(1)-alpha
[1]   Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]   Done                       wait-n-failure_random_sleep
[5]-  Done                       wait-n-failure_random_sleep
[6]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[... All job id 1 again.]
96 processes waited / 100 processes forked
12 seconds

Same deal here.

$ source ~/random/wait-n-failure run monitor posix
run true
explicit_pids false
monitor true
notify false
posix true
bash 5.3.0(1)-alpha
[2]   Done                       wait-n-failure_random_sleep
[1]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]   Done                       wait-n-failure_random_sleep
[5]-  Done                       wait-n-failure_random_sleep
[6]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[1]+  Done                       wait-n-failure_random_sleep
[... All job id 1.]
96 processes waited / 100 processes forked
12 seconds

I wasn't expecting 'jobs' output while sourcing, and I thought posix mode
would make it not output any 'jobs' info until immediately prior to a
prompt.

$ source ~/random/wait-n-failure run explicit_pids monitor
run true
explicit_pids true
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
[1]   Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]   Done                       wait-n-failure_random_sleep
[5]   Done                       wait-n-failure_random_sleep
[6]   Done                       wait-n-failure_random_sleep
[7]   Done                       wait-n-failure_random_sleep
[8]+  Done                       wait-n-failure_random_sleep
[1]   Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[...]
100 processes waited / 100 processes forked
11 seconds

Much more what I would expect to see for job ids. This is already a whole
lot of testing output to throw in the body of an email, but the ids go up
and down, never settling to all jobs having job id 1.

We allow the functions to be available on the command line:
$ source ~/random/wait-n-failure
run false

$ wait-n-failure_main
explicit_pids false
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
[1] 2295
[2] 2296
[3] 2297
[4] 2298
[5] 2299
[6] 2300
1 processes waited / 6 processes forked
0 seconds

I guess telling the user the job id and pid of the background job just
forked isn't specifically a monitor mode thing? We can basically see the
same behavior we saw when sourcing wait-n-failure before you made this
change to the devel branch, though. You're just narrowing the cases where
you get this behavior.

$ wait-n-failure_main explicit_pids
explicit_pids true
monitor false
notify false
posix false
bash 5.3.0(1)-alpha
[1] 2315
[2] 2316
[3] 2317
[4] 2318
[5] 2319
[6] 2320
[7] 2322
[1] 2324
[1] 2326
[1] 2328
[... We see job ids hovering around 1 and 2 a lot, but they find their way
all the way up to 9 and back down again. Probably just the foreground call
to wait-n-failure_random_sleep () causing this.]
100 processes waited / 100 processes forked
10 seconds

Again, the goal is for calling this function without explicit_pids to give
the same behavior as we currently see, calling it with that argument, at
least given a lack of preexisting, un-waited-for child processes.

$ wait-n-failure_main monitor
explicit_pids false
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
[1] 2510
[2] 2511
[3] 2512
[4] 2513
[5] 2514
[1]   Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[4]-  Done                       wait-n-failure_random_sleep
[6] 2515
[3]   Done                       wait-n-failure_random_sleep
[5]-  Done                       wait-n-failure_random_sleep
[6]+  Done                       wait-n-failure_random_sleep
[1] 2517
[1]+  Done                       wait-n-failure_random_sleep
2 processes waited / 7 processes forked
0 seconds

Bad.

$ wait-n-failure_main monitor notify
explicit_pids false
monitor true
notify true
posix false
bash 5.3.0(1)-alpha
[1] 2519
[2] 2520
[1]-  Done                       wait-n-failure_random_sleep
[3] 2521
[4] 2522
[5] 2523
[4]-  Done                       wait-n-failure_random_sleep
[6] 2524
[5]-  Done                       wait-n-failure_random_sleep
[2]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[6]+  Done                       wait-n-failure_random_sleep
1 processes waited / 6 processes forked
0 seconds

Bad.

$ wait-n-failure_main monitor posix
explicit_pids false
monitor true
notify false
posix true
bash 5.3.0(1)-alpha
[1] 2526
[2] 2527
[3] 2528
[4] 2529
[5] 2530
[1]   Done                       wait-n-failure_random_sleep
[6] 2531
[2]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]   Done                       wait-n-failure_random_sleep
[5]-  Done                       wait-n-failure_random_sleep
[6]+  Done                       wait-n-failure_random_sleep
1 processes waited / 6 processes forked
1 seconds

Bad.

$ wait-n-failure_main explicit_pids monitor
explicit_pids true
monitor true
notify false
posix false
bash 5.3.0(1)-alpha
[1] 2533
[2] 2534
[3] 2535
[4] 2536
[5] 2537
[2]   Done                       wait-n-failure_random_sleep
[6] 2538
[1]   Done                       wait-n-failure_random_sleep
[3]   Done                       wait-n-failure_random_sleep
[4]   Done                       wait-n-failure_random_sleep
[7] 2540
[...]
100 processes waited / 100 processes forked
10 seconds

Good.

On Mon, Aug 26, 2024 at 10:57 AM Chet Ramey <chet.ramey@case.edu> wrote:
>
> On 8/14/24 11:22 PM, Zachary Santer wrote:
> > On Wed, Aug 14, 2024 at 3:22 PM Chet Ramey <chet.ramey@case.edu> wrote:
> >>
> >> On 8/7/24 2:47 PM, Zachary Santer wrote:
> >
> >>> If you want the behavior of 'wait -n' to be
> >>> consistent between scripts and the interactive shell, then it should
> >>> choose one terminated child process from the list of those that is
> >>> maintained in the interactive shell, if it's nonempty, to report to
> >>> the user and to clear from that list, any time it is called.
> >>
> >> I'm not sure returning the status of some random process from some
> >> arbitrary point in the past is going to be valuable.
> >
> > I think the value is in the consistent behavior of 'wait -n', which
> > this would provide. If the user is intent on running 'wait -n' without
> > id arguments in the interactive shell, they can ensure that child
> > processes forked long ago are ignored by simply calling 'wait' without
> > -n before moving on to what they're trying to do.
>
> Sure, they can do that. That's a new requirement, though.

I've seen you point out "I can't imagine why a person would do X, so it
must never happen" as being fallaciou. However, I think the benefit to
consistent behavior far outweighs the hardship caused to whoever would
write a script intended for use within the interactive shell that depends
on 'wait -n' without id arguments ignoring background processes that the
user has already been notified of via the 'jobs' output.

If the behavior here isn't modified, the man page really should note that
'wait -n' without id arguments won't return the termination status of a
child process that has already been notified through the 'jobs' output.
This still happens in the interactive shell when job control is disabled,
for that matter. Just having to come up with a way to explain this behavior
in the man page seems like solid motivation to change it.

> > On Wed, Aug 14, 2024 at 4:44 PM Robert Elz <kre@munnari.oz.au> wrote:
> >>
> >>    | Maybe the thing to do is to retain jobs in the job list, even
after
> >>    | they're marked as notified,
> >>
> >> I'd do the opposite, once they're notified, they should be deleted
> >> from the jobs table, and everywhere else.   But "notified" only happens
> >> when the script explicitly asks (in a non-interactive shell, never
because
> >> of any other event than an appropriate command issued by the script,
and
> >> in an interactive shell, the same, or the implicit "jobs" before each
PS1).
> >
> > The implicit 'jobs' isn't happening before each PS1,
>
> This isn't what POSIX says to do, anyway.
>
>   but after each
> > command completes. Thus, all the
> >> [1]   Done                    random_sleep
> > notifications when sourcing wait-n-failure, before it prints
> >> 3 processes waited / 8 processes forked
> >> 1 seconds
> > and exits.
>
> Kind of. The `interactive shell' isn't interactive while it's not reading
> input from the terminal, so the shell prints notifications when a job
> terminates. This is what happens when you source a file.

So my initial understanding of what 'set -o posix' was supposed to do now
was wrong?

> > So, actually only doing the implicit 'jobs' work and moving things
> > from the jobs table to the list of saved pids and statuses before each
> > PS1 *would* be a solution here.
>
> Before the next prompt, you probably mean.
>
> > When sourcing wait-n-failure, it's
> > going to do all its work before any PS1 prompt.
>
> The behavior of performing notifications and removing jobs from the table
> is long-standing: it's been this way since 1999, and is a mechanism to
> prevent long-running sourced scripts from filling up the jobs list (which
> was a lot smaller in '99). So you need to accommodate those backwards
> compatibility issues somehow.

'wait -n' without id arguments reporting the termination status of a child
process that has already been reported to the user through the 'jobs'
output and clearing that information from the list of saved ids and
statuses would then be less of a disruption.

> > I'm less concerned about what happens when a user types 'wait -n'
> > independently on the command line. The human is in the loop at that
> > point.
>
> The shell is interactive at that point; different rules apply.
>
>
> >>> So basically, 'wait -n' should be implemented such that sourcing the
> >>> script with a false argument gives the same behavior as you've seen
> >>> when sourcing it with a true argument: the infinite loop.
> >>
> >> How long should notification be deferred? Until the script completes?
> >
> > That's more or less the solution I presented above. 'wait -n' without
> > id arguments returning the termination status of a child process that
> > the user has already been informed of through the implicit 'jobs'
> > output would also work, and might be less of a weird behavior change
> > for users to get over.
>
> OK. How would you reconcile the backwards compatibility issue?

There's always ${BASH_COMPAT}, but considering the surprising and arguably
undesirable nature of 'wait -n' without id arguments not returning the
termination status of a child process that has already been reported to the
user through the 'jobs' output, I would really question why someone would
write code dependent on that behavior in the first place. And again, this
issue has never come up in a script intended to be called normally (without
it calling 'jobs').

This whole issue is such a corner case, though it seems like an
easily-solved problem.

> There are only three approaches.

And those are?

On Mon, Aug 26, 2024 at 11:01 AM Chet Ramey <chet.ramey@case.edu> wrote:
>
> On 8/16/24 8:21 AM, Zachary Santer wrote:
> > On Wed, Aug 14, 2024 at 11:22 PM Zachary Santer <zsanter@gmail.com>
wrote:
> >>
> >> The implicit 'jobs' isn't happening before each PS1, but after each
> >> command completes. Thus, all the
> >>> [1]   Done                    random_sleep
> >> notifications when sourcing wait-n-failure, before it prints
> >>> 3 processes waited / 8 processes forked
> >>> 1 seconds
> >> and exits.
> >>
> >> So, actually only doing the implicit 'jobs' work and moving things
> >> from the jobs table to the list of saved pids and statuses before each
> >> PS1 *would* be a solution here. When sourcing wait-n-failure, it's
> >> going to do all its work before any PS1 prompt. Same deal if a user
> >> wants to call a function with 'wait -n' in it from the command line,
> >> invoke the edit-and-execute-command readline command, or just type a
> >> bunch of different commands separated by semicolons into a single
> >> command line.
> >
> > This breaks down with 'set -b'/'set -o notify'. Short of 'wait -n'
> > printing a warning message or erroring out when it is invoked while
> > 'set -b' is active, this isn't a complete solution.
>
> If you enable the notify option, which is not the default, you should be
> responsible for managing the consequences. notify is always going to
result
> in different behavior; see
>
>
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_11

It's not clear from the bash manual that there's a relationship between
printed 'jobs' notifications and what 'wait -n' without id arguments will
report. Under the (fair) assumption that there is none, one would think
that 'set -b' would also have no effect.

> > I really think the solution here is for 'wait -n' to return the
> > termination status of a child process that has already terminated and
> > that the user has already been informed of. Ultimately, whatever set
> > of commands is being invoked together and the user who is being
> > informed of terminated child processes are two different things.
> > Informing the user does nothing for the set of commands.
>
> No, that counts as notification. After the user is notified, the shell
> is free to remove the job from the list. Bash happens to keep the status
> around for a while;

Bash does that because that behavior is more useful. The user might want to
call 'wait' with an id argument and find that process's termination status
programmatically, despite the 'jobs' output having already informed them.
In the same vein, it's more useful for 'wait -n' to be able to guarantee a
one-to-one relationship of forked child process to 'wait -n'-returned
termination status.

> kre, for instance, advocates removing it entirely.

That would preclude what he was asking for earlier, wouldn't it?

On Fri, Jul 12, 2024 at 8:41 PM Robert Elz <kre@munnari.oz.au> wrote:
>
> [U]se the first definition of "next job to
> finish" - and in the case when there are already several of them,
> pick one, any one - you could order them by the time that bash reaped
> the jobs internally, but there's no real reason to do so, as that
> isn't necessarily the order the actual processes terminated, just
> the order the kernel picked to answer the wait() sys call, when
> there are several child zombies ready to be reaped.

Removing the status entirely after 'jobs'-output notification would prevent
the above from working, right? Or maybe he was then under the same
impression that I was: that 'wait -n' would fail to report the termination
status of child processes that had terminated prior to the call to 'wait
-n' in all circumstances. When it's the result of a race between the 'jobs'
output and the call to 'wait -n', it's okay?

Attachment: wait-n-failure
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]