July 11, 2024

Smarter Tests for LLMs on the Rise

Large language models (LLMs) are rapidly approaching human-level performance on popular tests, prompting Hugging Face to revamp its influential Open LLM Leaderboard with tougher challenges.

The new system uses more rigorous benchmarks designed to resist manipulation and better reflect real-world capabilities. This shake-up resulted in significant changes, with some models jumping or dropping as many as 59 places.

The revised leaderboard tackles these issues by introducing a fresh set of tests:

  • MMLU-Pro: This updated version of the prior MMLU multiple-choice question set offers ten answer choices (instead of the previous four) and eliminates overly simple questions. Difficulty is further increased by including misleading answers. It aligns well with human preferences as measured by the LMSYS Chatbot Arena.
  • GPQA: Designed to challenge even PhD-level expertise, these questions delve into biology, physics, and chemistry, making them difficult for non-specialists even with access to worldwide web search
  • MuSR: This benchmark assesses complex multi-step reasoning by asking models to solve mysteries, assign tasks in narratives, and identify the location of objects within stories.
  • MATH lvl 5: Featuring only the most challenging multi-step math problems.
  • IFEval: Testing a model’s ability to follow specific instructions like ‘no capital letters allowed’  or structuring responses in sections.
  • BIG-Bench Hard: This benchmark evaluates skills like understanding complex logic, detecting sarcasm, and recognizing shapes from graphical data. 

Combatting Leakage: Leaked test data compromising model evaluation is a growing concern. While Hugging Face leverages open benchmarks, some organizations are exploring alternative solutions:

  • Vals.AI: This independent testing company develops proprietary industry-specific tests for legal and financial sectors.
  • Scale AI: Utilizing proprietary tests in natural languages, math, and coding, this data consultancy maintains its own leaderboards.

Understanding the Importance

With two million unique visitors in the past year and over 300,000 active Hugging Face community members, the Open LLM Leaderboard wields significant influence. Developers rely on its scores to select models and track progress in the open-source LLM landscape. The change in approach will allow for a more rigorous approach to the assessment of models, which will result in a better assessment and a more objective rating.

picture from Freepic

you might also like…
Jul 4, 2024

MindCraft at IT Cluster’s Members And Partners Assembly

Large language models (LLMs) are rapidly approaching human-level performance on popular tests, prompting Hugging Face to revamp its influential Open... Read more

Jul 18, 2024

OpenAI’s “Strawberry” Project: A Revolution in AI Research

Large language models (LLMs) are rapidly approaching human-level performance on popular tests, prompting Hugging Face to revamp its influential Open... Read more

Contact Us

  • Contact Details

    +380 63 395 42 00
    team@mindcraft.ai
    Krakow, Poland

    Follow us