It's only open source if the training data is and it probably isn't, is it?
Futurology
It’s only open source if the training data is and it probably isn’t, is it?
I don't know, though DeepSeek talk of theirs being "fully" open-source.
Part of the advantage of doing this (apart from helping bleed your rivals dry) is to get the benefit of others working on your model. So it makes sense to maximise openness and access.
Realistically, no LLM that’s large enough to be competitive will be able to remain open-source, even if it was initially (and most that claim to be weren’t actually, as you point out), because so much training data is needed.
Often the training data can’t be re-distributed in the first place, but even if it can be, its availability makes it much more likely that someone will request the takedown of some data in the set (even if the data was licensed, someone who holds copyright might claim that the person who submitted it to the set wasn’t permitted to do so). At that point, unless the takedown request is refused or the model itself is re-trained (which would be quite expensive) the data is no longer sufficient to generate the model.