emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [NonGNU ELPA] New package: llm


From: Daniel Fleischer
Subject: Re: [NonGNU ELPA] New package: llm
Date: Mon, 21 Aug 2023 09:36:42 +0300
User-agent: Gnus/5.13 (Gnus v5.13)

Jim Porter <jporterbugs@gmail.com> writes:

> The link says that this model has been pretrained, which is certainly
> useful for the average person who doesn't want (or doesn't have the
> resources) to perform the training themselves, but from the
> documentation, it's not clear how I *would* perform the training
> myself if I were so inclined. (I've only toyed with LLMs, so I'm not
> an expert at more "advanced" cases like this.)

When I say people can train models themselves I mean "fine tuning" which
is the process of taking an existing model and make it learn to do a
specific task by showing it a small number of examples, as low as 1000
examples. There are advanced techniques that can train a model by
modifying a small percentage of its weights; this type of training can
be done in a few hours on a laptop. See
https://huggingface.co/docs/peft/index for a tool to do that. 

> I do see that the documentation mentions the training datasets used,
> but it also says that "great efforts have been taken to clean the
> pretraining data". Am I able to access the cleaned datasets? I looked
> over their blog post[1], but I didn't see anything describing this in
> detail.
>
> While I certainly appreciate the effort people are making to produce
> LLMs that are more open than OpenAI (a low bar), I'm not sure if
> providing several gigabytes of model weights in binary format is
> really providing the *source*. It's true that you can still edit these
> models in a sense by fine-tuning them, but you could say the same
> thing about a project that only provided the generated output from GNU
> Bison, instead of the original input to Bison.

To a large degree, the model is the weights. Today's models mainly share
a single architecture, called a transformer decoder. Once you specify
the architecture and a few hyper-parameters in a config file, the model
is entirely determined by the weights. 

https://huggingface.co/mosaicml/mpt-7b/blob/main/config.json

Put differently, today's models differ mainly by their weights, not
architectural differences. 

As for reproducibility, the truth is one can not reproduce the models,
theoretically and practically. The models can contain 7, 14, 30, 60
billion parameters which are floating point numbers; is it impossible to
reproduce it exactly as there are many sources for randomness in the
training process. Practically, pretraining is expensive; it requires
hundreds of GPUs and training costs are 100,000$ for small models and up
to millions for larger models.

Some models do release the training data, see e.g.
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

A side note: we are in a stage where our theoretical understanding is
lacking while practical applications are flourishing. Things move very
very fast, and there is a strong drive to productize this technology,
making people and companies invest a lot of resources into this. However
the open source aspect is amazing; the fact that the architecture, code
and insights are shared between everyone and even some companies share
the models they pretrained under open licensing (taking upon themselves
the high cost of training) is a huge win to everyone, including the open
source and scientific communities because now the innovation can come
from anywhere.

-- 
Daniel Fleischer



reply via email to

[Prev in Thread] Current Thread [Next in Thread]