LiveBench: A Challenging, Contamination-Free LLM Benchmark

This paper introduces LiveBench, a benchmark that mitigates test set contamination and biases from human evaluations. It includes a variety of challenging tasks across math, coding, reasoning, and more, with questions sourced from recent competitions and articles. LiveBench aims to provide a fair evaluation of LLM capabilities, with ongoing updates to maintain its relevance. The authors evaluate numerous models, revealing that even top models struggle to achieve high accuracy, thus highlighting the benchmark's difficulty.

Visit Resource