bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support in sort for human-readable numbers


From: Vitali Lovich
Subject: Re: Support in sort for human-readable numbers
Date: Tue, 6 Jan 2009 13:16:15 -0500

On Tue, Jan 6, 2009 at 12:26 PM, Pádraig Brady <address@hidden> wrote:
> Vitali Lovich wrote:
>> On Tue, Jan 6, 2009 at 10:19 AM, Pádraig Brady <address@hidden> wrote:
>>> I like the idea.
>>>
>>> So it doesn't support sorting these correctly for example:
>>>
>>> 999MB
>>> 998MiB
>>> 1GiB
>>> 1030MiB
>>>
>>> I.E. a mixture of ^2 and ^10 are not supported,
>>> nor overlapping number ranges.
>
> I'm not complaining about the above. Just clarifying.
>
>>> +  /* FIXME: maybe add option to check for longer suffixes (i.e. gigabyte) 
>>> */
>>>
>>> You should allow at least G, GiB and GB formats.
>>> Probably should print error if more than one of those
>>> formats used, since that's not supported.
>>
>> I dunno if you read my previous post, but I presented the reasoning
>> that if the user has some kind of longer format, it's better handled
>> by piping the input through a sed script first.  Can you present a
>> situation where it would be better for sort itself to try and parse
>> longer suffixes?
>>
>> On a side note, the XiB format (MiB, GiB) is extremely uncommon in my
>> opinion.
>
> It's debatable, but I think we should support the XiB and XB formats
> as I've seen them quite often, and certain coreutils like dd for example
> take this format as a size specifier. Also look at human_readable() from 
> gnulib.
Perhaps - but for sort, at least from my thinking of how I would
implement this, the additional logic (at least to behave correctly on
all inputs) would be somewhat complicated.  Can you please explain why
you believe this belongs in sort and wouldn't be better served by
pre-processing the text before sort & post-processing it after as
necessary?

Supporting all the various ways the human_readable can be output is
just not practical or even useful since the user would have to refer
to the manpage every time to figure out what switches to enable to
configure the proper behaviour.  Also, compare the amount of code that
human_readable is to convert from a number into a string (a much
easier problem) vs how much additional code there is to add this one
feature.

Sure dd may take that as input, but they're in a different situation -
they actually need to understand what number the user is actually
representing.  We don't need that extra logic since sort doesn't
really need to know - it can work without converting it from a string.

I'm not saying you're incorrect - I'm just asking you justify it by
providing a use-case where the alternative to not providing the logic
within sort would result in a complicated shell-script workaround for
the end-user.

> Alternatively you could allow any string starting with [KMGT..]
> to allow things like KB/s KiBuckets, but then it would be
> tricky to flag mixtures of KiB and KiBuckets as an error for example.
That's definitely not an acceptable solution because the behavior
would be incorrect if you had something like 2Klingongs.

>>> +  /* FIXME:  a_order - b_order || raw_comparison can be used - would that
>>> +     be faster? */
>>>
>>> Yep if you're not supporting overlapping number ranges then
>>> you can skip the number comparison totally if the suffixes don't match.
>> Actually it has nothing to do that.  I'm was just thinking that the
>> equality operation I'm testing for is already essentially doing a
>> subtraction and then I'm returning the actual subtraction itself.
>
> Oh right.
> Anyway the optimization I mentioned would probably be useful.
Debatable.  You'd still have to scan the string to find the end of the
number to find the suffix.  And if you get a miss (i.e. same
suffix-level), then you'll have to scan the strings again, performing
the comparison.  So it's not even obvious that there would be an
advantage when the suffixes differ (it might be faster, but I don't
think it can possibly be more than a 2-3% difference since you're just
skipping the comparison of two characters that are presumably already
in registers, or at least the cache) and there's definitely a hit
(about 2x slower) when they don't.

Vitali

reply via email to

[Prev in Thread] Current Thread [Next in Thread]