[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

how to deal with large dataset? (was Re: Where should we put machine lea

From: Simon Tournier
Subject: how to deal with large dataset? (was Re: Where should we put machine learning model parameters ?)
Date: Thu, 06 Apr 2023 20:55:55 +0200


Well, we already discussed in GWL context where to put “large” data set
without reaching a conclusion.  Having “large” data set inside the store
is probably not a good idea.  But maybe these data of models are not
that “large” to worry about the store.

On lun., 03 avril 2023 at 18:48, Nicolas Graves via "Development of GNU Guix 
and the GNU System distribution." <> wrote:

> In the case of nerd-dictation, the model parameters that can be used
> are listed here :

Here, it is not that large…

--8<---------------cut here---------------start------------->8---
vosk-model-en-us-0.22              1.8G
vosk-model-en-us-0.42-gigaspeech   2.3G
vosk-model-ru-0.10                 2.5G
--8<---------------cut here---------------end--------------->8---

…compared to already some packages about data:

--8<---------------cut here---------------start------------->8---
$ for p in $(guix build -S $(guix package -A 'r\-' | grep genome | cut -f1)); 
do du -sh $p ;done | sort -hr | head -9
--8<---------------cut here---------------end--------------->8---

but still.  Well, I do not know if this data set of 2G fits the store
but I do not have better to propose.

> One caveat is that using all these models can take a lot of space on the
> servers, a burden which is not useful because no build step are really
> needed (except an unzip step). In this case, we can use the
> #:substitutable? #f flag. You can find an example of some of these
> packages right here :

It is what is done for some packages in gnu/packages/bioconductor.scm

> So my question is: Should we add this type of models in packages for
> Guix? If yes, where should we put them? In machine-learning.scm? In a
> new file machine-learning-models.scm (such a file would never need new
> modules, and it might avoid some confusion between the tools and the
> parameters needed to use the tools)?

Well, gnu/packages/machine-learning-data.scm or s/data/models sounds
good to me.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]