This forum is synced with our Discord channel

Turing Pi

Bad design of the benchmark imo. Of course you can benchmark everything, only boils down to getting reproducible results, design meaningful tests and control for potential confounders.
Surely difficult at times, but not impossible. The main difference with those models is, that they are fairly difficult to compare, as no one designed an accepted benchmark test set for now - let alone a test procedure, i.e. how to deal with the statistical nature of the output?
That‘s why this chatbot arena with peer to peer duels just like chess matches is actually a good way forward. Eventually it converges to a decent "general ability" leaderboard. Of course such scores only order the models and don‘t allow statements like "x% better than" or so. Better than nothing though.