We consistently find across all our experiments that, across concepts, the frequency of a concept in the pretraining dataset is a strong predictor of the model’s performance on test examples containing that concept. Notably, model performance scales linearly as the concept frequency in pretraining data grows exponentially
This reminds me of an older paper on how LLMs can't even do basic math when examples fall outside the training distribution (note that this was GPT-J and as far as I'm aware no such analysis is possible with GPT4, I wonder why), so this phenomena is not exclusive to multimodal stuff. It's one thing to pre-train a large capacity model on a general task that might benefit downstream tasks, but wanting these models to be general purpose is really, really silly.
I'm of the opinion that we're approaching a crisis in AI, we've hit a barrier on what current approaches are capable of achieving and no amount of data, labelers and tinkering with architectural minutiae or (god forbid) "prompt engineering" can fix that. My hopes are that with the bubble bursting the field will have to reckon with the need for algorithmic and architectural innovation, more robust standards for what constitutes a proper benchmark and reproducibility at the very least, and maybe, just maybe, extend its collective knowledge from other fields of study past 1960's neuroscience and explore the ethical and societal implications of your work more deeply than the oftentimes tiny obligatory ethics section of a paper. That is definetly a overgeneralization, so sorry for any researchers out here <3, I'm just disillusioned with the general state of the field.
You're correct about the C suites though , all they needed to see was one of those stupid graphs that showed line going up, with model capacity on the x axis and performance on the y axis, and their greed did the rest.