Problem To Solve
It’s not always possible to tell what data a model was trained on, even if you have access to the model weights, because training models is a compression of training data. This introduces several challenges that do not exist in traditional software: • Complicates copyright • Harder to fairly compensate the owners of the data • Harder to know who trained each part of the model in settings where training is done by more than one party, e.g. Model Zoo • Easier to add biases in models Source: Mohamed Baioumy & Alex Cheema (AI x Crypto Primer)
Problem Solution
Make the training process itself verifiable. Build tools to break down how a model was trained and check if it contains a given piece of data. Several approaches can be explored: Integrate cryptographic primitives into the training process itself. For example, Pytorch NFT Callback hashes the current network weights, some metadata (data, accuracy, etc…) and your eth address every N epoch, which proves who did the model training. Note: This approach introduces a performance overhead to training models.
Inspiration
Source: Mohamed Baioumy & Alex Cheema (AI x Crypto Primer) Full credits go toward these two legends.