It's a bit like whitewashing because it's
reconstructing generatively by finding
artificial/contrived associations between
different works that the author had not
intended but may have been part of their
inspiration inspiration, and it compresses the
information based on these assocations.
It's a bit like running a lossy 'zip' on the
internet and then decompressing
probabilistically.
When run deterministically (set the temperature of GPT to 0), you may actually
see 'snippets' from various places, every time, with the same input generating
the same snippets.
So the source material is important.
What GitHub did was very, very bad but they
did it anyway.
That doesn't mean GPT is bad, it just means
they zipped up content they should not have
and created this language 'index' or ('codex'
is what they call it).
What they really should do, if they are honest
people, is train the model on subsets of
GitHub code by separate licence and release
the models with the same license.
Shane Mulligan
How to contact me: |
| |