As large language models (LLMs) rapidly advance, benchmarking their capabilities has become essential for assessing progress and guiding future research. A diverse array of benchmarks has emerged, each designed to evaluate specific facets of language understanding and generation, spanning domains such as commonsense reasoning, mathematical problem-solving, code generation, and question-answering.
We analyzed the most popular open- and closed-source LLMs to devise a comprehensive list of the most widely used benchmarks for evaluating state-of-the-art LLMs.
Some of the most popular LLMs and their papers, according to the LMSYS Chatbot Arena Leaderboard, a crowdsourced open platform with over 500,000 human preference votes:
Claude 3: Technical Report
Gemini 1.5: Technical ReportΒ
GPT-4: Technical ReportΒ
Command R+: Blog Post
Qwen1.5: Blog Post
Mistral Large: Blog Post
Mixtral 8x7B: Blog Post
Llama 2: Technical Report
Now, to the main list of the benchmarks! β
Commonsense Reasoning
HellaSwag
Objective: Test how well an LLM can understand and apply everyday knowledge to logically complete scenarios.
Format: Multiple choice situation -> predict likely continuation
Challenge: Difficult for LLMs but easy for humans (>95% accuracy) due to the need for real-world knowledge and logical reasoning.
Original paper: HellaSwag: Can a Machine Really Finish Your Sentence?Β
