emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [NonGNU ELPA] New package: llm


From: Jim Porter
Subject: Re: [NonGNU ELPA] New package: llm
Date: Sun, 20 Aug 2023 23:03:30 -0700

On 8/20/2023 10:12 PM, Andrew Hyatt wrote:
The training of these is fairly straightforward, at least if you are familiar with the area. ... the LLM we are talking about here use this technique to train and execute, changing some parameters and adding things like more attention heads, but keeping the fundamental architecture the same.

I think the parameters would be a key part of this (or potentially all of the code they used for the training, if it does something unique), as well as the *actual* training datasets. That's why I'm especially concerned about the line in their docs saying "great efforts have been taken to clean the pretraining data". I couldn't find out whether they provided the cleaned data or only the "raw" data. From my understanding, properly cleaning the data is labor-intensive, and you wouldn't be able to reproduce another team's efforts in that area unless they gave you a diff or something equivalent.

I'm not an expert, but I believe that due to the use of stochastic processes in training, even if you had the exact code, parameters and data used in training, you would never be able to reproduce the model they make available.  It should be equivalent in quality, perhaps, but not the same.

This is a problem for reproducibility (it would be nice if you could *verify* that a model was built the way its makers said it was), but I don't think it's a critical problem for freedom.

To me, I believe it should be about freedom.  Not absolute freedom, but relative freedom: do you, the user, have the same amount of freedom as anyone else, including the creator?  For the LLMs like huggingface and many other research LLMs, the answer is yes.

So long as the creators provide all the necessary parameters to retrain the model from "scratch", I think I'd agree. If some of these aren't provided (cleaned datasets, training parameters, any direct human intervention if applicable, etc), then I think the answer is no. For example, the creator could decide that one data source is bad for some reason, and retrain their model without it. Would I be able to do that work independently with just what the creator has given me?

I see that there was a presentation at LibrePlanet 2023 (or maybe shortly after) by Leandro von Werra of HuggingFace on the ethics of code-generating LLMs[1]. It says that it hasn't been published online yet, though. This might not be the final answer on all the concerns about incorporating LLMs into Emacs, but hopefully it would help.

In practice though, I think if Emacs were to support communicating with LLMs, it would be good if - at minimum - we could direct users to an essay explaining the potential ethical/freedom issues with them. On that note, maybe we could also take a bit of inspiration from Emacs dynamic modules. They require a GPL compatibility symbol[2] in order to load, and perhaps a hypothetical 'llm-foobar' package that interfaces with the 'foobar' LLM could announce whether it respects users' freedom via some variable/symbol. Freedom-respecting LLMs wouldn't need a warning message then. We could even forbid packages that talk to particularly "bad" LLMs. (I suppose we can't stop users from writing their own packages and just lying about whether they're ok, but we could prevent their inclusion in ELPA.)

[1] https://www.fsf.org/bulletin/2023/spring/trademarks-volunteering-and-code-generating-llm

[2] https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Initialization.html



reply via email to

[Prev in Thread] Current Thread [Next in Thread]