bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Feature request for cut and/or sort


From: The Wanderer
Subject: Re: Feature request for cut and/or sort
Date: Mon, 23 Jul 2007 17:44:21 -0400
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922

I am directing this to the list again, since most of it is still in some
sense discussion of whether it is appropriate to add the requested
feature to cut et al.. However, parts of it are trending offtopic, and I
would not object to private replies to those parts if someone feels that
appropriate; I am also not unwilling to take further discussion offlist
entirely if the local bigwigs request it, although I would bring it back
on if anything resembling a conclusion in my favour were reached in
offlist discussion.

Bob Proulx wrote:

The Wanderer wrote:

(&*%&*!@ incorrect reply behaviour... I *hate* having to type out
the address by hand in every post.)

Since you are talking to someone who *strongly* opposes munging the
Reply-To: header your distress is noted as theatrical and also being
self-inflicted since you could easily do a group-followup-to-all.
:-/

Reply-to-all - in the case where part of "all" is a mailing list - is
just as bad in a different way, since it means that if the sender of the
message being replied to is in fact subscribed then that person will
receive multiple copies of the reply. It also does not prevent me from
receiving multiple copies from people who do take that as the easy way
out.

However, I am not strongly attached to setting Reply-To - or even merely
modifying it, which IMO does not have the negative aspects of munging as
they are usually conceived - as The One True Way to solve the problem
(although it is the only widely supported one I know of), nor am
presently attempting to argue for a change in the relevant policy for
this list; I am merely continuing to more-or-less-quietly mention my
dissatisfaction with the result of the existing state of affairs, and I
intend to continue to do so - not merely in this forum but in all forums
where it applies - until such time as the problem no longer exists.

I do not presently intend to make further public comment on this topic
in this thread.

Bob Proulx wrote:

The Wanderer wrote:

How about for sort?

Hmm...  sort...  I guess you just have to count the fields. (shrug)

How am I to do that when there are different numbers of fields on
different lines in the input data?

That is correct.  But cut is really not the best tool for the
job.

In that case, it probably is not the best tool for the job of
cutting from the beginning of the line, either. Are there any other
than historical reasons ("it's become standard, and people expect
it to be there") for not removing it entirely?

I would happily do without 'cut' but after thirty years of having it
around is there any reason not to simply leave it?  As mom says,
don't poke at it.

If having it around leads people to think of it as the tool for the job,
when it cannot accomplish that job in all cases and there are other
tools which can, then it should either be extended to accomplish that
job in those cases or be removed.

(Or, at minimum, a note should be added to the documentation stating
"This tool cannot accomplish X aspect of the job you might expect it to
be able to. If you need something to do that, consider tool Y instead."
But that runs afoul of the fact that tool Y is orthogonal to the tool at
hand.)

(Naturally, I don't want it removed. The point is that the fact
that there are better tools available for a given task does not
inherently mean it is not worthwhile to have "worse" tools capable
of performing the same task.)

But in the case of awk the number of characters to type to use it and
the "obviousness" of how it is used is quite good.  And it is also
very useful to know how to use so the effort spent for the basic
usage pays itself back in increased functionality very quickly.

...I'm not sure I see how this counters the stated point. Just because
awk can do it, however easily, when one knows how to use it, that does
not mean that cut should not be able to do it. cut is easier to find
when someone is attempting to locate a tool for the job, and easier to
learn if only by virtue of having less involved documentation. (Which
itself is a consequence of being more limited.)

Instead I recommend and use awk for these types of things.

echo /path/to/somefile | awk -F/ '{print$NF}'

I've never had occasion (or, beyond the general availability
Somewhere of documentation, opportunity) to learn awk. In any case
a program to which I have to pass an esoteric incantation is less
convenient than one to which I can simply pass arguments to simple
options.

I think I will simply have to agree to disagree.  In basic usage awk
is quite simple to learn and use and well worth learning.  I rarely
write long awk programs anymore though because I prefer ruby, and
before that perl, so much more.  These days my use of awk is mostly
one-liner commands of the basic 'grep' and 'column printing' type.

Having once learned and become familiar with any of those, they are
probably at least as convenient to the mind as cut is. However, cut is
still easier to find (when looking for a tool to accomplish the desired
task) and to learn than awk is, at least for the ordinary user.

How well will these work in cases where there is other, extraneous
data on the line before the path begins?

Examples please?  Bilbo's riddle is quite a bit too challenging for
me today.

The output of the 'moosic' program, specifically 'moosic list', takes
the form of:

[0] /path/to/files/file1
[1] /path/to/files/file2
[2] /path/to/files/subdirectory/file3
...
[84] /path/to/files/subdirectory7/subsubdirectory/file83

Et cetera. Reaching four-digit numbers in the brackets is not uncommon.

But pretty much in any extensive case I would use 'sed' to do any
complicated stream editing.  Again, it is very standard and would
work on any posix platform.  So I will jump ahead and suggest it.

I do know (sort of) and use sed, but again, adding another command with
another incantation into the pipeline reduces the convenience and the
obviousness of the solution.

(Regardless, the 'find' solution will not work in most of the cases
I see, because the program which outputs the data I need to parse
does not and cannot be made to null-terminate.)

The 'find' example was merely to generate a largish number of
pathnames quickly and easily and without a lot of setup.

However, it also had the side effect of producing null-terminated lines,
such that xargs could parse it cleanly even if the input contained
spaces. Without the ability to produce null-terminated input segments,
xargs will not work on most datasets.

You say that you want the second from the end?  Subtract the
number from the end.

$ echo /one/two/three/four | awk -F/ '{print$(NF-1)}' three

How about the final two, including their separator?

  $ echo /one/two/three/four | awk -F/ '{print$(NF-1)"/"$NF}'
  three/four

Clunky, in that you have to explicitly give the separator every time -
and, therefore, not trivially extensible for larger collections of final
fields. Still, it works.

On the command line I jam everything together without spaces just
like an old Fortran programmer. But in scripts I do use whitespace as
appropriate to increase readability. This might be easier to read.

  $ echo /one/two/three/four | awk -F/ '{ print $(NF-1) "/" $NF }'
  three/four

Noted. I figured that awk would be able to do this, but did not know
how.

(I kind of expect a "RTFM awk" here. The point, however, is that
this is much less convenient and intuitive to *find out about* -
much less to use - than are the options to cut.)

I would rather make it attractive to you such that you *want* to
learn more about it.  But I can see that I have already failed.

Oh, I do want to - but I do not necessarily have the time or the mental
energy to do so "in time" to have this be a convenient solution. If I
had already learned awk, I would almost certainly use it - but I do not
think that I would consider its availability a good enough reason to
refuse to make such a natural extension to these tools.

Language shapes the way people think[1].  If the only tool available
is a hammer then all problems look like nails.  If the only tools
available to you are 'cut' then sure expand it infinitely.  But then
eventually 'cut' will look like Ada or C++ with features solely
because their absence would offend someone.

I am aware of this tendency. I've had to restrain an urge towards
featureitis in a few of my own (private-use) programs. I would not want
to push for extending almost any program indefinitely. However, the
feature at hand seems like something *missing* from cut - because cut
provides select-by-field functionality already, but does not do so
completely. If cut did not provide select-by-field functionality, it
might be reasonable to argue that it should not be extended to do so -
but it would also be much less useful.

(Of course, then someone wuold probably request the ability to have cut
count characters from the end of the line, and the same basic discussion
would arise. But that's another matter.)

Fortunately shell programmers have a rich set of tools available and
are not limited to simply using 'cut'.  Tools such as 'awk' and 'sed'
and others are too good to want to avoid learning.  Personally I
enjoy learning about alternative ways to do things.  It makes
programming a joy and not a drudgery.

In principle and philosophy I agree with you. That does not change my
position on this matter.

This does not by itself address doing the same thing with programs
other than cut, though the much more terse message from Andreas
Schwab seems to indicate that there are ways to accomplish it
generically, but it does provide at least minimal incentive for me
to attempt to learn awk. (I'm having enough trouble with sed and
bash, and haven't improved at C in much of a decade - attempting to
gain reflexive-recall mastery of another language is not a pleasant
prospect...)

At least in the one-liner form it lends itself to use very quickly.
This covers 99.44% of everything you need to know.

  awk '/RE-pattern/{print $NUMBER}'

Assuming that I use it often enough to remember even that much.

Which I may well come to do; awk is looking more attractive the longer
this discussion continues.

Nah...  Just use awk.  It is standard, portable and already does
what you ask along with many more features.  The syntax of doing
this with awk is quite obvious.  It is short and quick to type
when doing it on the command line.

The same argument could, presumably, be provided for the functions
for which cut does provide options. The advantages of having a
separate utility appear to be that it is more convenient to use
quickly and is easier to discover and learn.

I was not opposed to seeing fields-from-the-end added to cut.  I was
opposed to using cut at all.  :-)

This was not clear.

In that case, you are probably not the best person to be responding to
my initial query, and you should - if consistent (not that consistency
is necessarily All That) - be arguing for dropping cut entirely.

I do consider myself a comparatively advanced user (having built my
own system from parts and administered it more or less
independently for years), although I am nowhere near anything like
mastery yet, and I still find the (reputedly both complex and
powerful) languages whose use seems to be the standard response to
requests for enhancement to these tools to be intimidating.

Hmm...  But if a complex tool intimidates and by being too complex
prevents people from using them then keeping the core utilities
simple should be a prime guideline so that they remain usable in the
future.

The difference is that in order to use e.g. awk for this purpose, you
need to learn at least the basics of the complexity of its language,
whereas to use cut, you need only to know standard command-line syntax
and a few options easily discovered by scanning the manual.

That being another benefit of separate utilities: separate, complete,
brief documentation for each. That cut cannot do what I wanted to do
with it can be learned in a matter of seconds by reading its man page.
To learn that awk can would require reading through the man page far
enough to discover the section on fields, reading that closely enough to
understand it (not a given, since getting that far in a manual that long
holds the potential to involve much skimming), and realizing that the
feature described could be used to achieve the desired effect. (Assuming
that I even had occasion to look at the awk manual in the first place,
since to someone not already familiar with it it is not an obvious place
to look.)

If there is no resistance to adding features then the utilities would
become very cluttered and would become one of those complex programs
to which you are now raising as objectionable by being too complex.

This seems to follow at first glance, but on further consideration, I
think it's nonsense. There are obvious, natural boundaries to cut, by
virtue of the fact that it is a program with a single purpose. The
addition of functionality which is irrelevant to that purpose would not
be reasonable, and could - indeed, should - be rejected.

The 'cp' and 'rsync' programs come to mind.  I know that you are
talking about cut and not cp but it is illustrative of the issue.
Often people ask for features in cp that already exist in rsync.  The
rsync program has a long list of features and continues to evolve.
Right now most people do not use rsync when they simply want to copy
a file from here to there. Why not? It is perfectly capable of the task. And it will do a zillion other tasks too

For that matter, going in the other direction (lower-level features
instead of higher-level ones), so is dd.

I think many would answer because rsync is too complex for that task
or too bloated or too slow or other answers indicating that they used
cp because it was simple, direct and to the point.

Interestingly, this is roughly the same as one of the arguments I have
provided (or attempted to provide) against using awk rather than cut.

What program would people be able to use if cp became the same as
rsync?  There would in that case no longer be a simple program to
fall back upon.

The purpose of cp is complex enough that its boundaries are not so clear
and obvious as are those to some programs. If rsync did not exist, it
could be argued that cut would be a natural place for its features to be
provided. Since rsync does exist, I would agree that its features should
not be added to cp.

However, rsync is not a general-purpose tool; it is a special-purpose
program which fulfills that purpose rather well. awk, on the other hand,
is - if not necessarily truly general-purpose (depending on precisely
how one defines that) - at least a strongly multipurpose tool, covering
a multitude of functions.

Suggesting awk to someone who has asked for an extension to cut seems
vaguely akin to suggesting a Swiss Army knife to someone who has asked
for a screwdriver. Sure, the tool can probably handle the job they need
done, but it has a lot of unnecessary complexity unrelated to that job,
and the tool they asked for would have been entirely sufficient.

Expecting a more basic user - who may have been lucky in stumbling
across e.g. cut or sort at all - to spend the time to learn them,
when these capabilities seem natural extensions of the abilities
the tools already have, seems to me like at best a dubious and at
worst a damagingly elitist position.

Arguably programs such as awk and such are part of the set of shell
programming programs that every shell programmer should know.  By
suggesting awk (and at other times find-xargs and such) I am not
trying to be elitist but simply trying to pass along knowledge of the
rich set of possible tools in common use for shell programming.  The
problem is that shell scripting is not a closed system.

I'm not sure I'd agree that day-to-day usage of the command line and the
programs available thereon should be considered part of "shell
programming", or that a person who does such should be considered a
shell programmer.

The biggest issue I have with cut in practical use is the rigid
definition of fields separated by single TAB characters.  Awk, Perl,
Ruby, all have a more liberal definition of fields separated by
whitespace.  (Perl was not always that way but evolved into it.  The
perl 'split' default with no arguments was changed and was like cut
in the old days and now is like awk.)

Not necessarily single TAB characters, but yes, single characters. I
agree that the inability of cut to treat even a sequence of consecutive
identical characters as single field separator, much less to treat
sequences of different characters (whitespace) as one, is annoying and
limiting. I've run up against that in one of the few shell scripts I've
actually written to date.

--
      The Wanderer

Warning: Simply because I argue an issue does not mean I agree with any
side of it.

Secrecy is the beginning of tyranny.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]