How to evaluate a LLM which is getting increasingly complex and skilled?

Evaluating LLMs with benchmarks will be over very soon. Here is why:

1. Benchmarks are based on human knowledge

2. Benchmarks are difficult to collect and biased

3. LLMs are already close to 100% scores on benchmarks like SAT and Bar

From the 3 points above, it is clear that we will run out of benchmarks soon.

And in a short while, we will run out of human referees also, as LLMs will do things that are beyond human measure.

The example of a ELO system described by David Hershey looks promising.

LLMs competing again each other, in their own non-human league, like Chess bots competitions.

Chess is a good example.

Nowadays, top human players cannot explain most bot evaluations of a position. Same for Go play.

Without the ELO system, it would be impossible to evaluate and compare chess bots.