Re: [NonGNU ELPA] New package: llm

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [NonGNU ELPA] New package: llm

From:	Jim Porter
Subject:	Re: [NonGNU ELPA] New package: llm
Date:	Sun, 20 Aug 2023 23:03:30 -0700

On 8/20/2023 10:12 PM, Andrew Hyatt wrote:

The training of these is fairly straightforward, at least if you arefamiliar with the area. ... the LLM we are talking about here use this technique to train and execute, changing some parameters and adding things like more attention heads, but keeping the fundamental architecture the same.

I think the parameters would be a key part of this (or potentially allof the code they used for the training, if it does something unique), aswell as the *actual* training datasets. That's why I'm especiallyconcerned about the line in their docs saying "great efforts have beentaken to clean the pretraining data". I couldn't find out whether theyprovided the cleaned data or only the "raw" data. From my understanding,properly cleaning the data is labor-intensive, and you wouldn't be ableto reproduce another team's efforts in that area unless they gave you adiff or something equivalent.

I'm not an expert, but I believe that due to the use of stochasticprocesses in training, even if you had the exact code, parameters anddata used in training, you would never be able to reproduce the modelthey make available. It should be equivalent in quality, perhaps, butnot the same.

This is a problem for reproducibility (it would be nice if you could*verify* that a model was built the way its makers said it was), but Idon't think it's a critical problem for freedom.

To me, I believe it should be about freedom. Not absolute freedom, butrelative freedom: do you, the user, have the same amount of freedom asanyone else, including the creator? For the LLMs like huggingface andmany other research LLMs, the answer is yes.

So long as the creators provide all the necessary parameters to retrainthe model from "scratch", I think I'd agree. If some of these aren'tprovided (cleaned datasets, training parameters, any direct humanintervention if applicable, etc), then I think the answer is no. Forexample, the creator could decide that one data source is bad for somereason, and retrain their model without it. Would I be able to do thatwork independently with just what the creator has given me?

I see that there was a presentation at LibrePlanet 2023 (or maybeshortly after) by Leandro von Werra of HuggingFace on the ethics ofcode-generating LLMs[1]. It says that it hasn't been published onlineyet, though. This might not be the final answer on all the concernsabout incorporating LLMs into Emacs, but hopefully it would help.

In practice though, I think if Emacs were to support communicating withLLMs, it would be good if - at minimum - we could direct users to anessay explaining the potential ethical/freedom issues with them. On thatnote, maybe we could also take a bit of inspiration from Emacs dynamicmodules. They require a GPL compatibility symbol[2] in order to load,and perhaps a hypothetical 'llm-foobar' package that interfaces with the'foobar' LLM could announce whether it respects users' freedom via somevariable/symbol. Freedom-respecting LLMs wouldn't need a warning messagethen. We could even forbid packages that talk to particularly "bad"LLMs. (I suppose we can't stop users from writing their own packages andjust lying about whether they're ok, but we could prevent theirinclusion in ELPA.)

[1]https://www.fsf.org/bulletin/2023/spring/trademarks-volunteering-and-code-generating-llm

[2]https://www.gnu.org/software/emacs/manual/html_node/elisp/Module-Initialization.html

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [NonGNU ELPA] New package: llm, (continued)

Prev by Date: Re: Shrinking the C core
Next by Date: Re: Shrinking the C core
Previous by thread: Re: [NonGNU ELPA] New package: llm
Next by thread: Re: [NonGNU ELPA] New package: llm
Index(es):
- Date
- Thread