[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug in xtime.h

From: Bruno Haible
Subject: Re: bug in xtime.h
Date: Mon, 23 Dec 2019 07:18:03 +0100
User-agent: KMail/5.1.3 (Linux/4.4.0-166-generic; KDE/5.18.0; x86_64; ; )

Hi Paul,

> >  xtime_sec (xtime_t t)
> >  {
> >    return (t < 0
> > -          ? (t + XTIME_PRECISION - 1) / XTIME_PRECISION - 1
> > +          ? (t + 1) / XTIME_PRECISION - 1
> >            : xtime_nonnegative_sec (t));
> Thanks for pointing out the bug. We can simplify the fix further (and speed it
> up a bit on typical hosts).

While I like the code you installed - it is simpler than the one I proposed -,
I must point out that it's hard to predict what speed characteristics
"typical hosts" will show. When I compile this file with gcc-9.2.0 -O2 -S
(or similarly with clang)
long long sec1 (long long t)
{ return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); }

long long sec2 (long long t)
{ return t / 1000000000 - (t % 1000000000 < 0); }

I get this assembly code:

        testq   %rdi, %rdi
        js      .L5
        movabsq $1237940039285380275, %rdx
        movq    %rdi, %rax
        sarq    $63, %rdi
        imulq   %rdx
        movq    %rdx, %rax
        sarq    $26, %rax
        subq    %rdi, %rax
        movabsq $1237940039285380275, %rdx
        addq    $1, %rdi
        movq    %rdi, %rax
        sarq    $63, %rdi
        imulq   %rdx
        sarq    $26, %rdx
        subq    %rdi, %rdx
        leaq    -1(%rdx), %rax

        movabsq $1237940039285380275, %rdx
        movq    %rdi, %rax
        imulq   %rdx
        movq    %rdx, %rax
        movq    %rdi, %rdx
        sarq    $63, %rdx
        sarq    $26, %rax
        subq    %rdx, %rax
        imulq   $1000000000, %rax, %rdx
        subq    %rdx, %rdi
        shrq    $63, %rdi
        subq    %rdi, %rax

Similarly with clang 9:

        movq    %rdi, %rax
        testq   %rdi, %rdi
        js      .LBB0_1
        shrq    $9, %rax
        movabsq $19342813113834067, %rcx
        mulq    %rcx
        movq    %rdx, %rax
        shrq    $11, %rax
        addq    $1, %rax
        movabsq $1237940039285380275, %rcx
        imulq   %rcx
        movq    %rdx, %rax
        shrq    $63, %rax
        sarq    $26, %rdx
        addq    %rdx, %rax
        addq    $-1, %rax

        movabsq $1237940039285380275, %rcx
        movq    %rdi, %rax
        imulq   %rcx
        movq    %rdx, %rax
        shrq    $63, %rax
        sarq    $26, %rdx
        addq    %rax, %rdx
        imulq   $1000000000, %rdx, %rax
        subq    %rax, %rdi
        sarq    $63, %rdi
        leaq    (%rdi,%rdx), %rax

So, sec1 has one more conditional jump, whereas sec2 has one more 64-bit
multiplication instruction in its path. How well will the branch
prediction unit be able to optimize the conditional jump?

#include <stdlib.h>

static inline long long sec1 (long long t)
{ return (t < 0 ? (t + 1) / 1000000000 - 1 : t / 1000000000); }

static inline long long sec2 (long long t)
{ return t / 1000000000 - (t % 1000000000 < 0); }

volatile long long t = 1576800000000000000LL;
volatile long long x;

main (int argc, char *argv[])
  int repeat = atoi (argv[1]);
  int i;

  for (i = repeat; i > 0; i--)
    x = sec1 (t); // or sec2 (t)

Results (compiled each with -O2, ran with argument 1000000000,
on an Intel Core m3 CPU):

                 gcc             clang

sec1           1.28 ns          1.04 ns
sec2           1.78 ns          1.78 ns

And on sparc64:


sec1           7.79 ns
sec2           8.06 ns

And on aarch64:


sec1          27.5 ns
sec2          55.0 ns


Again: I'm not asking to optimize this particular function. Simply,
from time to time, I like to question the assumptions we make about
the compiler and about "typical hosts".


reply via email to

[Prev in Thread] Current Thread [Next in Thread]