qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code ge


From: Michal Suchánek
Subject: Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators
Date: Thu, 23 Nov 2023 18:37:47 +0100
User-agent: Mutt/1.10.1 (2018-07-13)

On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There has been an explosion of interest in so called "AI" (LLM)
> > code generators in the past year or so. Thus far though, this is
> > has not been matched by a broadly accepted legal interpretation
> > of the licensing implications for code generator outputs. While
> > the vendors may claim there is no problem and a free choice of
> > license is possible, they have an inherent conflict of interest
> > in promoting this interpretation. More broadly there is, as yet,
> > no broad consensus on the licensing implications of code generators
> > trained on inputs under a wide variety of licenses.
> >
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack
> > of consensus on the licensing of "AI" (LLM) code generator output,
> > it is not considered credible to assert compliance with the DCO
> > clause (b) or (c) where a patch includes such generated code.
> >
> > This patch thus defines a policy that the QEMU project will not
> > accept contributions where use of "AI" (LLM) code generators is
> > either known, or suspected.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index b4591a2dec..a6e42c6b1b 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -195,3 +195,43 @@ example::
> >    Signed-off-by: Some Person <some.person@example.com>
> >    [Rebased and added support for 'foo']
> >    Signed-off-by: New Person <new.person@example.com>
> > +
> > +Use of "AI" (LLM) code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions
> > +  which are believed to include or derive from "AI" (LLM)
> > +  generated code.**
> > +
> > +The existence of "AI" (`Large Language Model 
> > <https://en.wikipedia.org/wiki/Large_language_model>`__
> > +/ LLM) code generators raises a number of difficult legal questions, a
> > +number of which impact on Open Source projects. As noted earlier, the
> > +QEMU community requires that contributors certify their patch submissions
> > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > +patch contains "AI" generated code this raises difficulties with code
> > +provenence and thus DCO compliance.
> 
> I agree this is going to be a field that keeps lawyers well re-numerated
> for the foreseeable future. However I suspect this elides over the main
> use case for LLM generators which is non-novel transformation. One good
> example is generating text fixtures where you write a piece of original
> code and then ask the code completion engine to fill out some unit tests
> to exercise the code. It's boring mechanical work but one an LLM is very
> suited to (even if you might tweak the final result).

It may be suited to produce such code (disputable) but the code is not
suited for inclusion into the project, for legal reasons.

> > +To satisfy the DCO, the patch contributor has to fully understand
> > +the origins and license of code they are contributing to QEMU. The
> > +license terms that should apply to the output of an "AI" code generator
> > +are ill-defined, given that both training data and operation of the
> > +"AI" are typically opaque to the user. Even where the training data
> > +is said to all be open source, it will likely be under a wide variety
> > +of license terms.
> > +
> > +While the vendor's of "AI" code generators may promote the idea that
> > +code output can be taken under a free choice of license, this is not
> > +yet considered to be a generally accepted, nor tested, legal opinion.
> > +
> > +With this in mind, the QEMU maintainers does not consider it is
> > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > +generated code.
> 
> There is a load of code out that isn't eligible for copyright projection
> because it doesn't demonstrate much originality or creativity. In the
> experimentation I've done so far I've not seen much sign of genuine
> creativity. LLM's benefit from having access to a wide corpus of
> training data and tend to do a better job of inferencing solutions from
> semi-related posts than say for example human manually comparing posts
> having pasted an error message in google.

And license of that corpus of training data is not defined.

If you could erase the copyright on anything by feeding it into a
statistical model and pulling it back out there would be some big
content license holders objecting so it's very unlikely to happen.

Consequently, for all practical purposes the "AI"/LLM output is
derivative work of the input with all legal consequences.

This is, of course, only a problem for *generative* use of AI/LLM where
the putput can contain contain copies of substantial parts of input.

Thanks

Michal



reply via email to

[Prev in Thread] Current Thread [Next in Thread]