libreplanet-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[libreplanet-discuss] Machine learning and copyleft


From: Amias Hartley
Subject: [libreplanet-discuss] Machine learning and copyleft
Date: Sat, 10 Dec 2016 03:42:34 +0300

Let's consider a machine learning system consisting of two parts:

1. Training program. It takes a dataset and produces a trained model. A trained model is usually stored as a few serialized arrays of floating point numbers.
2. Inference program. This one takes a pre-trained model and some data as input and produces some output based on them.

Both training and inference programs are free software licensed under GPL.

Let's also suppose that the there is a model that is a result of execution of the training program taking as input some publicly available dataset. This dataset (for example a set of labeled photographs or natural texts) is permitted by the publisher to be unlimitedly used for training machine learning models, and usage of the trained models is not restricted by the dataset publisher.

So the entire system is distributed as: training program with sources, inference program with sources, training dataset, trained model.

Someone could take this system, modify the training program, and train a new model on the same dataset. Then he or she could publish only inference program with sources, the unmodified training dataset, and the new trained model. Because the end user doesn't need the modified training program to run the inference program with the new model, it is not distributed, because technically the only user of the modified training program is those who trained a new model using it, so GPL doesn't require to distribute it.

However, in this case freedom of users of the distributed system (inference program and the new model) is violated because they can't retrain the model on new data or improve the training code and retrain it on the same data to improve performance of the model.

My question is how is it possible to protect users' freedom by making everyone who distributes a trained model to distribute also sources of a training program that was used to train the model, and instructions for obtaining the training dataset?

Could the problem be solved by GPL, or GPL is not enough for this case? If not, is there a license that provides the required guarantees? I'd like to note that by the definition of the problem the dataset is published by a third party, and while it could be used without restrictions for any machine learning task, it couldn't be relicensed, and anyway the protection of freedom of obtaining modified sources of the training program should be preserved even if some other dataset is used for training of a new model instead of the original one.

P. S. It's an interesting question can model weights be considered to be software or not. While some machine learning models could in theory contain arbitrary logic, such as neural Turing machines (https://arxiv.org/abs/1410.5401), others, such as convolutional neural networks (https://en.wikipedia.org/wiki/Convolutional_neural_network) are more limited in their capabilities, but in practice are very expressive, and others, for example, logistic regression, are much more limited. It is desirable to have a way to protect users' freedom without regard to complexity of a particular model.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]