[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Optionally using more advanced CPU features

From: Dave Love
Subject: Re: Optionally using more advanced CPU features
Date: Fri, 01 Sep 2017 11:46:16 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

Ludovic Courtès <address@hidden> writes:

>> That may be the best way to handle it, but it's not widely available,
>> and isn't possible generally (as far as I know), e.g. for Fortran code.
>> See also below.  This issue surfaced again recently in Fedora.
> Right.  Do you have examples of Fortran packages in mind?

Not much off-hand because, shall we say, there's a shortage of the sort
of profiling information that's necessary for system performance
engineering and procurement.  It's not in Guix, but cp2k is a (mainly)
Fortran program that is, or was, used as performance regression test for
GCC.  I only know about its profile for cases where time in MPI or fftw
is most relevant.  However, two of its kernels, ELPA, and libsmm (as
libxsmm) have low-level optimized versions for x86_64, but only Fortran
implementations for other architectures as far as I know.

Otherwise, BLAS/LAPACK for any micro-architectures that don't have
support in free optimized variants like OpenBLAS.

>> In cases that don't dispatch on cpuid (or whatever), I think the
>> relevant missing OS/tool support is SIMD-specific hwcaps in the loader.
>> Hwcaps seem to be essentially undocumented, but there is, or has been,
>> support for instruction set capabilities on some architectures, just not
>> x86_64 apparently.  (An ancient example was for missing instructions on
>> some SPARC systems which greatly affected crypto operations in ssh et
>> al.)
> But that sounds similar to IFUNC in that application code would need to
> actually use hwcap info to select the right implementation at load time,
> right?

As far as I know, it's a loader feature.  See "Hardware capabilities" in

> >> There’s probably scientific software out there that can benefit from
> >> using the latest SSE/AVX/whatever extension, and yet doesn’t use any of
> >> the tricks above.  When we find such a piece of software, I think we
> >> should investigate and (1) see whether it actually benefits from those
> >> ISA extensions, and (2) see whether it would be feasible to just use
> >> ‘target_clones’ or similar on the hot spots.
> >
> >> One example which has been investigated, and you can't, is BLIS.  You
> (Why “you can’t?”  It’s free software AFAICS on
> <>.)

Well, you could embark on some sort of (GCC-specific?) re-write, but it
would be better to work on <>.
I don't think there's anywhere you can just attach GCC attributes, and
certainly no magic will happen for currently-unsupported architectures.

>> need it for vaguely competitive avx512 linear algebra.  (OpenBLAS is
>> basically fine for previous Intel and AMD SIMD.)  See, e.g.,
>> <>
>> et seq.  I don't know if there's any good reason to, but if you want
>> ATLAS you have the same issue -- along with extra issues building it.
> ATLAS is a problem because it does built-time ISA selection (and maybe
> profile-guided optimization?).

Yes, that's what I meant.  (I can't remember to what extent you can just
specify the architecture and build it without the parameter sweep.)

> I sympathize with the idea of having several ABI-compatible BLAS
> implementations for the reasons you give.  That somewhat conflicts with
> the idea of reproducibility, but after all we can have our cake and eat
> it too: the user can decide to have LD_LIBRARY_PATH point to an
> alternate ABI-compatible BLAS, or they can keep using the one that
> appears in RUNPATH.
> Thoughts?

Right, about the cake -- as with other packaging systems -- and
LD_LIBRARY_PATH/LD_PRELOAD are important for debugging and measurement
anyway.  [I know too much about computing and experimental science to
believe in reproducibility as it's normally talked about, though
facilities for reproducible builds and environment components are good.]

reply via email to

[Prev in Thread] Current Thread [Next in Thread]