guix-science
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Conda environments and reproducibility


From: Simon Tournier
Subject: Re: Conda environments and reproducibility
Date: Tue, 29 Nov 2022 21:10:20 +0100

Hi Hugo, all,

On Tue, 29 Nov 2022 at 14:12, Hugo Buddelmeijer <hugo@buddelmeijer.nl> wrote:

>                                                      However, conda seems
> to work fine for most people. It would therefore be instructive to have
> concrete 'failure stories' in order to show people that conda is not enough.

What I would do if I would try to convince my colleagues that Conda is
not enough.

1. Target one or two common environments; for example,
   (Python+Numpy+Scipy+Matplotlib) for one, and (R+Seurat) for two.

2. Generate the both environments following the Conda documentation.

Until here all should work smoothly. :-)

3. Commit the Conda files in a Git repository; for instance,

       for e in py rseurat
       do
         conda activate $e
         conda env export > environment-$e.yml
         conda list --explicit > explicit-spec-$e.txt       
         conda deactivate
       done

4.
   a) on the same machine, try to recreate the 2 environments.
   b) on another machine, idem.
   c) Commit to the Git repository how it goes.
   d) Remove the two environments and more on both machine.

5. Every new month, do #4.


Maybe it can be automated with a Cron task.  And maybe we could
collectively do this experience.  And we could do the same with
Guix. :-)

Well, we have not spoken about running something.  We could also write a
small Python script plotting something using Numpy and/or Scipy and try
to run the Seurat vignette.

>From my experience, after some months (from 2-3 to 6), Conda will fail.
Especially after an update of the system (apt upgrade)–and it can worse
with a ’dist-upgrade’. :-)
    

> On Tue, 29 Nov 2022 at 11:32, Thibault Lestang <t.lestang@imperial.ac.uk>
> wrote:
>
>> That's fair enough. Conda & pip are everywhere around me, and I'd like
>> to form an accurate picture of their shotcomings before mentioning
>> alternative approaches to people who use these tools everyday!
>
> I agree, let me share my perspective.

Conda and pip works very well when we have in mind a forward view of the
history.  By design, they fail when backward.  For engineering, they are
very efficient and personally I would rely on them **if** I had some
systems to maintain only caring about upgrading them.  Well, Conda, pip
or some other distro package manager.

The troubles are when you try to restore the past.  The 10 Years
Challenge [1] provides very good examples.  This report [2] (in French,
but an English version is probably around) provides very good insights,
IMHO, about the limitations of classical package managers (as Debian,
Conda, pip, etc.)

For what my biased opinion is worth, many shortcomings are around. :-)
For instance, this paper [3] points the reproduction was «so
time-consuming and resulted in only 11 out of 28 (39%) figure panels
conveying the same information».  Well, for sure it is hard to know if
the students tried hard or not–and the paper does not speak much about
the computational environment.

(Well, aside the transparency of the computational stack that Conda
barely provides, but that’s another story. :-))

1: <https://www.nature.com/articles/d41586-020-02462-7>
2: 
<https://hpc.guix.info/static/videos/atelier-reproductibilit%C3%A9-2021/arnaud-legrand.webm>
3: <https://doi.org/10.1371/journal.pcbi.1010615>


> That is, "conda env export" should contain entries like
> "scipy=1.8.0=py39hee8e79c_1", where the hee8e79c should uniquely define the
> dependencies 'that matter', like which compiler is used. What goes into the
> hash seems rather complicated, and grows over time.
>
> This hash is a great step forward in reproducibility. But it is too
> fragile. I can't directly see how, but I can easily assume that this
> dependency-hash mechanism leads to the problem that Konrad faced even when
> no files are overwritten. Maybe because a new dependency resolver in conda
> would have stricter rules on interoperability. (It is still possible that
> files indeed were overwritten though; it was probably an incident like this
> that made them change the hashes.)

Well, I think Conda documentation [4] about the solver for dependencies
put some warnings around this explicit mechanism.  It is a long time
that I have not given a look at Conda but from my understanding of the
solver documentation, this “failure” reported by Konrad appears to me
expected, by design of Conda. ;-)

If the solver tries to satisfy many constraints, then the problem is
more complex as the time is going.  So, Conda probably fails to find a
working combination.

If the solver is bypassed, then there is no guarantee that the generated
state is a working computational environment.  Conda recommends to
update in order to fix the potential issues.

4: <https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/solvers.html>


> One thing that conda (or actualy conda-forge) does well, are their bots.
> I'm a maintainer of some conda packages and once a month or so I get a
> fully automated pull request to update my package [4], e.g. when the
> upstream package is updated, or when a dependency is updated. They even
> have a tracking system for migrating dependencies that are used by many
> packages, such as compilers. This makes maintaining conda-forge packages a
> breeze. Having such bots also within the guix-ecosystem would probably help
> attract developers.

Cool!  Do you know if the code of these bots is available?


> By the way, it is quite hard to use conda in guix,

Maybe you could open bugs and/or report on help-guix or guix-devel the
annoyance you are observing.  For instance, I fully removed Conda from
my toolbox so I never hit annoyance. ;-)


Cheers,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]