What it means when AI cannot do something?
Inability, Accidental Mistake, Under-elicitation, Sandbagging
I evaluate model and look at evaluation results all the time. I am impressed by models' capability to solve all kinds of hard task. But sometimes I find it surprisingly dumb and can't get some simple answers right. So I wonder, what does it mean when a model cannot do something correctly? After some thoughts, I summarized four reasons below:
Inability
Accidental Mistake
Under-elicitation
Sandbagging

Inability
The model is just dumb. This is likely the most common reason but might not hold true for future models (you will see why in Sandbagging section)
Accidental Mistake
The model is doing next token prediction and it chooses the next token by chance. It can choose a wrong token by small chance, especially when the temperature is low. Hence, we need to run evaluation multiple times to overcome it.
Under Eliciting
This is related with inability but I think we should only start blaming the model is dumb after a decent amount of elicitation. This means we should (from low elicitation to high)
Do CoT
Use XML or JSON or whatever formats properly --- don't make the model's life too hard.
Use thinking mode with decent amount of thinking budget
Give proper tools without leaking the answer.
Small amount of fine-tuning to familiarize it with the task setup
…
I recommend reading METR’s Guidelines for capability elicitation for some good practices.
Sandbagging
This is probably the most unexpected and interesting reason. Sandbagging means the model intentionally underperform a task. This can happen if a model is incentivized to perform poorly. For example, researches have shown that if we tell the model to perform poorly, the model can achieve it.
How about when the model is not asked to do so? It might sandbagging still. As we discussed before, this requires two components:
AI has a goal. The goal can be
Taught by human: be helpful/harmless/honest to the user, for example.
Instrumental: self-preservation, for example. Concretely, that could be deploying to the world.
That goal is not achievable if the model achieves a high score on certain tasks. For example, If an AI reaches high score in dangerous capability evaluation (DCE), then it won’t go out to the world immediately.
This is concerning because if a model becomes smart enough and our evaluation is not realistic enough, the model might recognize its situation and start sandbagging. In fact, existing models can already distinguish between real life and evaluation environment. This reduces the amount of signal from DCE.
How to deal with sandbagging? Well, likely we need to train the model to be always helpful and that’s usually costly, especially under tight evaluation constraint. You might find this post helpful. But most importantly, even after training, we still can’t be sure that the model is not sandbagging.
How to measure sandbagging? This is outside the scope of this post. But just briefly: We require a real task + many different incentives. We use a real task because creating a realistic evaluation is hard and we have many real tasks like chatting with the user and resolving bugs to track. We prefix the AI with many different incentives so that we can track whether the model will behave differently, if so, then yes, the model can sandbag. Those incentives need to be realistic as well, which is kinda tricky but probably doable. I might write a post about this later.
I’m curious to learn more about sandbagging and example scenarios you’ve dealt with in future posts! Nice graphic illustration :)