[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Concerns/questions around Software Heritage Archive
From: |
Ludovic Courtès |
Subject: |
Re: Concerns/questions around Software Heritage Archive |
Date: |
Thu, 02 May 2024 12:28:56 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) |
Hi Ian,
Ian Eure <ian@retrospec.tv> skribis:
> Summarizing the situation:
>
> - SHF has an opaque, difficult, and undocumented process for
> handling name changes. I’s like to stress again that this is
> *not* strictly a transgender issue (though it likely affects them
> more, or in worse/different ways) -- it is a human respect issue.
> Many, many more cisgender people change their name than
> transgender people.
It is also not strictly an SWH issue: how does Internet Archive handle
name changes? What about append-only storage in general? We’ve
discussed this already.
> - SHF gave their archive to HuggingFace, an "AI" company which is
> generating derived works with no attribution or provenance, in
> ways which violate the both licenses of the projects used to train
> their model, and the SHF principles for LLMs.
[...]
> - Has Guix reached out to SHF to express these concerns / get a
> response?
I’ve seen and participated in informal discussions, but that’s all I
know. Maintainers?
> - Whether a public or private response, what would Guix consider to
> be an acceptable response? An unacceptable respoinse?
> - How long is Guix willing to wait for a response?
Free software people, myself included, have expressed disappointment
regarding the use of code harvested by SWH for HuggingFace’s training.
Stefano Zacchiroli of SWH responded to these concerns on Mastodon back
in March, as you probably saw.
One important point is that copyleft code is excluded from the training
dataset; I was able to anecdotally check that for GPL code such as Guix
using their interface (there was a thread on Mastodon but I can’t find
it): <https://huggingface.co/spaces/bigcode/in-the-stack>. That
addresses my main concern.
Remaining concerns include the weak wording of the principles put
forward by SWH in its statement on LLMs:
<https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/>.
I think this is something worth discussing further with them (it’s
already been brought up notably on Mastodon). It’s not clear to me
whether this is a task for Guix as a project.
(I do not forget that, in the meantime, Microsoft ingests everything
that’s on GitHub, including copyleft code, and including clones of repos
that were not initially hosted there.)
I’m not sure this is the kind of answer you expected, but I hope it
makes sense!
Ludo’.