Evaluating LLMs with benchmarks will be over very soon. Here is why:
1. Benchmarks are based on human knowledge
2. Benchmarks are difficult to collect and biased
3. LLMs are already close to 100% scores on benchmarks like SAT and Bar
From the 3 points above, it is clear that we will run out of benchmarks soon.
And in a short while, we will run out of human referees also, as LLMs will do things that are beyond human measure.
The example of a ELO system described by David Hershey looks promising.
LLMs competing again each other, in their own non-human league, like Chess bots competitions.
Chess is a good example.
Nowadays, top human players cannot explain most bot evaluations of a position. Same for Go play.
Without the ELO system, it would be impossible to evaluate and compare chess bots.