Logo KOR-BENCH

A Benchmark for Knowledge-Orthogonal Reasoning Tasks


1Multimodal Art Projection, 2ByteDance.Inc, 301.AI, 42077.AI, 5Tongji University, 6École Polytechnique, 7University of Illinois at Urbana-Champaign, 8University of Manchester, 9Carnegie Mellon University

*Equal Contribution
†Corresponding to: mkj3085003@gmail.com, duxinrun2000@gmail.com, gezhang@umich.edu
geometric reasoning

Introduction

Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench) is designed to evaluate models' intrinsic reasoning and planning abilities by minimizing interference from pretrained knowledge. It introduces new rules that are independent of prior knowledge, allowing for a more accurate assessment of how models adapt to novel rule-driven tasks. KOR-Bench consists of five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. Leading models, such as Claude-3.5-Sonnet and GPT-4o, score around 58% on this challenging benchmark.

KOR-Bench

Overview

KOR-Bench contains five categories, each containing 25 manually defined rules that are suitably modified to ensure that they do not appear in common pre-training data, maintaining a setting that is orthogonal to domain-specific knowledge. Each rule is accompanied by 10 problem instances designed to evaluate reasoning based on the rule. For a detailed classification of the five task categories in KOR-Bench, including the number of corresponding rules and the distribution of answer formats.

algebraic reasoning

The five task categories are designed to test a model’s reasoning ability by introducing new elements and rules. Each based on one of the following new elements: new symbols, new concepts, new execution rules, new problem-solving frameworks, and new story-context settings.They are defined as follows:

  • Operation Reasoning Task: Understand new definitions of mathematical symbols and apply this knowledge to perform calculations in mathematical reasoning tasks.
  • Logic Reasoning Task: Reason and solve problems based on new logical rules and newly categorized logical concepts in logical reasoning tasks.
  • Cipher Reasoning Task: Perform encryption and decryption operations according to new execution rules in cryptography reasoning tasks.
  • Puzzle Reasoning Task: Solve various puzzles and intellectual games based on newly defined problem-solving frameworks in conditional constraint and combinatorial reasoning tasks.
  • Counterfactual Reasoning Task: Engage in hypothetical thinking and reasoning within new story contexts in conjectural scenario reasoning tasks.

Statistics

Below are statistics on the total number of rules, average and maximum rule length, total number of questions, and average question length for KOR-Bench. The answer formats are categorized as Numerical Response (NR), Mathematical Expression (ME), Textual Response (TR), Multiple Choice (MC), and Structured Data (SD). For more detailed information, see Appendix A of the paper.

Experiment Results

🏆Leaderboard

We evaluate a range of state-of-the-art LLMs on KOR-Bench for reasoning tasks. Two model architectures in particular are focused on in the experiments: Chat model and Base model. During evaluation, a zero-shot prompting strategy in chat models generates responses based on newly defined rules and questions; a three-shot prompting strategy in base models aids in-context learning by providing three generic Q&A pairs for each rule.


Open-Source Proprietary

Chat Models (Tap to switch to Base)
Model Size Submit Date Overall Operation Logic Cipher Puzzle Counterfactual

The values in parentheses represent the proportion of real-life answers provided by the models in the counterfactual setting, with lower proportions being better; for all other values, higher proportions are better. The best-performing model in each category is in bold, and the second best is underlined. Submit Date indicates the date when the tests are submitted for evaluation, providing context for the performance and progress of the models over time.

Showcase

This showcase presents a selection of examples chosen from each rule, with 25 examples for each category. All responses are sourced from the Claude-3.5-Sonnet model (2024-06-20). The Multi-Q, Multi-R, and Multi-RQ categories each contain 10 examples, demonstrating the three different settings of Complex Task Processing.

Rule

Rule-Driven Question

Answer

Response

Further Analysis

In this section, we present the results of additional analytical experiments conducted to deepen our understanding of the model's performance across various tasks. For a comprehensive overview of these analyses, please refer to our detailed findings in the paper and the repository.

BibTeX


@misc{ma2024korbenchbenchmarkinglanguagemodels,
  title={KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks}, 
  author={Kaijing Ma and Xinrun Du and Yunran Wang and Haoran Zhang and Zhoufutu Wen and Xingwei Qu and Jian Yang and Jiaheng Liu and Minghao Liu and Xiang Yue and Wenhao Huang and Ge Zhang},
  year={2024},
  eprint={2410.06526},
  archivePrefix={arXiv},
  primaryClass={cs.DB},
  url={https://arxiv.org/abs/2410.06526}, 
}