[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Concerns/questions around Software Heritage Archive
From: |
Maxim Cournoyer |
Subject: |
Re: Concerns/questions around Software Heritage Archive |
Date: |
Thu, 09 May 2024 12:00:10 -0400 |
User-agent: |
Gnus/5.13 (Gnus v5.13) |
Hi Ian, Ludovic.
Ludovic Courtès <ludo@gnu.org> writes:
> Hi Ian,
>
> Ian Eure <ian@retrospec.tv> skribis:
>
>> Summarizing the situation:
>>
>> - SHF has an opaque, difficult, and undocumented process for
>> handling name changes. I’s like to stress again that this is
>> *not* strictly a transgender issue (though it likely affects them
>> more, or in worse/different ways) -- it is a human respect issue.
>> Many, many more cisgender people change their name than
>> transgender people.
>
> It is also not strictly an SWH issue: how does Internet Archive handle
> name changes? What about append-only storage in general? We’ve
> discussed this already.
>> - SHF gave their archive to HuggingFace, an "AI" company which is
>> generating derived works with no attribution or provenance, in
>> ways which violate the both licenses of the projects used to train
>> their model, and the SHF principles for LLMs.
>
> [...]
>
>> - Has Guix reached out to SHF to express these concerns / get a
>> response?
>
> I’ve seen and participated in informal discussions, but that’s all I
> know. Maintainers?
We haven't. Given some improvements were apparently already made by SWF
in response to concerns raised, it seems the dialogue should continue.
>> - Whether a public or private response, what would Guix consider to
>> be an acceptable response? An unacceptable respoinse?
>> - How long is Guix willing to wait for a response?
>
> Free software people, myself included, have expressed disappointment
> regarding the use of code harvested by SWH for HuggingFace’s training.
> Stefano Zacchiroli of SWH responded to these concerns on Mastodon back
> in March, as you probably saw.
>
> One important point is that copyleft code is excluded from the training
> dataset; I was able to anecdotally check that for GPL code such as Guix
> using their interface (there was a thread on Mastodon but I can’t find
> it): <https://huggingface.co/spaces/bigcode/in-the-stack>. That
> addresses my main concern.
>
> Remaining concerns include the weak wording of the principles put
> forward by SWH in its statement on LLMs:
> <https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/>.
> I think this is something worth discussing further with them (it’s
> already been brought up notably on Mastodon). It’s not clear to me
> whether this is a task for Guix as a project.
I don't think it is a task for Guix specifically, but rather for all
users of SWH or interested parties.
--
Thanks,
Maxim